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ABSTRACT 



This research proiect was activated by some intriquinq 
results of earlier work on the overlap among document 
rep'resentations. In that earlier study^ one representation 
used in the INSPEC data base proved to perform unexpectedly 
well in comparison with some other coramonly us'ed 
representations, such as a controlled vocabulary or 
free-text terms from the title/atstract of the document. 
That representation, free-index phrases, is mainly composed- 
of\ free-text phrases selected by an indexer from the 
title/abstract. The objectijies of the curtent research 
project were (1 ) to discover why the f ree-lndex phrases 
performed as well as they did, and (2) to attempt to produce 
surraqate free-index phrases automatically ^ron: the 
t itl€/abstract« 

The free -index phrases in samples of INSPEC 
title/abstracts were examined and the results of the 
previous study were reconsidered in light of the current 
project. Because most of the queries submitted to the 
free-index representation in the oriqi,nal study were 
searched with terms rather than phrases, our approach to 
generating a surrogate free-index representation began with 
phrases, but tested the effectiveness of their constituent 
words* JJe began with all of the noun phrases in the 
title/abstract. From these, several methods were used to 
select surrogate free-index phrases* Each method was 
compared statistically and eapirically against the actual 
free-index phrases and in all cases, the surrogates did not 
perform as well. No clearcut cause for the performance of 
the phrases was found. However, one viable possibility has 
to do with those relatively few free-index phrases which do 
not derive directly from the title/abstract of the document. 
These phrases are added by indexers at INSPEC and niost^ of 
them are taken from the controlled vocabulary. 
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The research sunuBarized in this document arose from some 
unexpected but interestiuq results in earlier work on document 
representations, (Katzer, et al. 1982). fts part of that effort 
We compared the performance of seven different document 
representations in a oodera te-sized portion of the INSPFC data 
base* One of those representations, "Free-Index Phrases" 
performed well on many key measures of retrieval per f or ©ance. 

Free-Index phrases, as ispleniented by INSTEC, is a unique 
form of document representation, not duplicated in ether data 
bases* The current work was initiated because it performed veil 
in comparison with othep representations and because it had not 
been analyzed previously* There are two aaicr cbiectives of thi-s 
research: 

1. To identify the defining characteristics of free-index 
phrases, what variables discriminate between that 
representation and other document representations. 

2m To develop an alqorithi to produce surroqate Eree-index 
phrases from the titles and abstracts of INSPEC documents 
and to evaluate the perforaance of the surroqate phrases 
in ccaparison with the true phrases. 

Accordingly this work is part of the literature of automatic 
indexinq- For at least twenty years various invest iqatcrs have 
attempted to find methods for representinq docuaents that do not 
require the use of human indexers but do perform at least as well 
as humanly derived index terms. Asa representation, free- index 
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phrases have inany cf the desirable characteristics. They are 
derived priaarily from the title and abstract of a document^ they 
are composed of relatively few vords^ and they perform at least as 
well as any other document representation in the INSPEC data base. 
If w€ are successful in finding an alqorithm to generate surrogate 
free-index phrases, we will have found an effective ard efficient 
docuaent representation which would warrant further serious 
consideration. 

To put the current research into context, a brief review of 
the experimental parameters and the results of the earlier study 
need to be presented. The maior portion of this document then 
suomarizes our efforts with regard to the two major obiectives 
noted above. 



ERIC 



9 



Paqe 3 



THE STUDY CF DCCDME NT CVBBLAP 

The overlap study had as its prioary objective the comparison 
of seven different document representations in terms of 
performance (recall and precisior) and overlap (proportion of 
documents retrieved that are identical) • About 12,000 records 
from the 1979 INSPEC data base were used. Each record was 
composed of a biblioqraph ic citation, an English lanquaqe abstract 
of about 50-75 words, and two sets of index terms, (See Appendix 
A). Eiqhty-four queries frou €9 users were searcheil on this data 
base by experienced and trained search intermediaries. Each query 
was searched separately seven ti«es, using each of the seven 
representations in turn^ The users were then qiven a merqed 
listinq cf the retrieved docunents and asked to iu.iq^ the 
relevance of each documents The research design enabled us to 
determine the effectiveness of each representation ard the degree 
of overlap for each pair of representations. The sevdn 
representations are briefly defined in Table !• 

The criterion variables were recall, precision and overlap. 
The recall ratio used has as its denominator the number of 
relevant documents retrieved ty all seven * representations. 
Relevance was detemined by the requestor using a scale which 
ranged from one to four. For some analyses a "strict" definition 
of relevance was used: only those judged "1" were ircluded. For 
other analyses a broader definition was employed: these documents 
rated either "1" or "2" were accepted. 
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Table 1 

% 

Seven Document Representations Used in Cverlap Studv 



Abbrev.a tion Description 

II Free-Ind€X Phrases: Phrases selected by 

an index^r; sost phrases were taken from 
the title and/or abstract, retaininq the 
author 's original words^ 

TT Title Sordsr Every non-trivial wor^ 

in the title of the document* 

AA Abstract Sords: Every non-trivial word 

in the abstract of the document, 

DD Descriptor Terms: Controlled vocabulary 

terms selected by an indexer from the 
INSPEC thesaurus* 



TA Title-Abstract Words: Every rdn-trivial 

word in the title or abstract* A coopouni 
representation of "uncontrolled" words 
TA equals the combination of TT plus AA* 

DI Indexer Selected Terms: ft compound 

representation aade up of DD plus II* 

ST Steamed Free-Text Terns: ST was produced 

by automatically removing the suffixes 
from the TA representation. 



A complete analysis of the results can be found elsewhere 
(Kat2er, et al.^ 1S82) . A brief suaoary of those results needs to 
be discussed here« 

In the table below, the recall and precision results are 
aqqreqated into a single haraonic nean usinq the approach proposed 
by van Biisberqen (1979)- Because the search intermediaries were 
instructed to conduct "hiqh-recaH" searches, it is iaportant to 
consider the combined measure at several levels: Part A of Table 
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2 veiqhts precision twice as iuporfant as recall in the combined 
■easure. Part D weights recall <five> tiaes a€ iroportant as 
precisioo. 



Table 2 

Coibined Recall/Precision Results for Free-Index Phrases 



Weightings Strict Belevance Broad Relevance 





A: 


■Precision 
= twice 
recall 


1 

h-Wear 
.226 


2 

Bank 
2 


06.7 


h-Mean 
.369 


Bank 
1 


r>1K 
16.0 




Br 


Precision 
recall 


• 260 


K 


02- 8 


.343 


1 


1 1.14 




c: 


Secall 
= twice 
precision 


-.307 • 


2 


-01.6 


.320 


1 


06.7 




D: 


Recall 
= five 
precision 


r ' 

.339 


2 


-04.8 


.309 


1 


on. 7 



1 The "harionic aean" has teen scaled froi a low of zero 
to a hich of one. 

2 The rank reflects the performance of ?ree-Ind€x Phrases 
relative to the other six representations. A rank of 
1 indicates that the representation had the highest 
perforiance level. 

3 Because lost efforts at autoiatic indexing begin with 
words occurring in the title and abstract, it is 
interesting to ccapare bov inch better (or worse) 
free- index phrases performed relative to the 
representation. It is pa^rticularly interesting because 
■ost of the II representation derives fron Ti. 
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Several points seem apparent froir Table 2. First, the 
free-index phrase representation performed quite well relative to 
the other six representations {thouqh in absolute terios none o^ 
them perforned out standinqly) • Even when precision was weiqhted 
twice as ioportant as recall (Part A), II's performance remained 
hiqh; this is noteworthy because' the intermediaries were 
instructed to perform hiqh--cecall searches* Second, free-index 
phrases perform better when a broader definition of relevance is 
employed* 

Clearly the difference between the IT representation and the 
TA representation is sliqht and none of the differences are 
qreater than that which could have been caused by chance. Thus, 
in terras of iust recall and p^^ecision, there are no qrounds for 
pursuing free-index phrases because it is much more 
straightforward to attempt autoiatic indexing usinq words from the 
title and abstract. 

It is when we considered the relationship amonq the seven 
representations (one indicator cf overlap) that the potential of 
the free-index phrases became mere evident* Two related measured 
of that relationship are considered here. The first asks which of 
the seven representations retrieves the qreatest nuirber of 
relevant documents; this first measure is siraply recall and is 
qiven here to provide a context for tTie second measure. If 
relevance is defined broadly, then the IT representation 
(free-index phrases) contributed the most wit^ a recall of .306. 
The second highest representation was non-trivial words from the 

ErIc '13 
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abstract (•283) • If relevance is iefine'l so that only those 
iudqed as "1" were included ^ (strict definition), then 
title/abstract words performed best (- 369) an^l free-in^lex phrases 
were second (•3U8). ^ 

Thus^ if fre€-- index phrased ^ wete the sole ^locume nt 

representation iu this data base, they would still retrieve a 

large proportion of thfe relevant documentsT This would be 
f 

understandable*^ if II were cosposed. of as irany different terras as 
the title/a1)stracfcs^ocabular y. Eut as we shalT see later, II does 
not have these attributes. 

The second measure considered each representation in ternjs of 
the number of relevant docyments ^it contributed after the other 
six representations had retrieved all they could- Here, 
reqa^^dless of the definition of relevance (strict or broad), 
free-index phrases contributed the greatest ruiater of previously 
unretrieved relevant documents r- 9«5? - 1 In concrast, the 

title/abstract representation- contributed between 6.5? - 
unique relevant documents* 

Clearly, free- index phrases contribute relevant documents to 
the retrieved^^ output, and this is true when II is the only 
representation or when it is one of several. Also, it does so 
relatively efficiently in terms of storage space {II has a small er 
vocabulary than title/abstract words) and without excessive loss 
in terms of precision of retrieval (see Part A df Table 2)- For 
all of these reasons, we believe free-Jndex phrases as implemented 
by INSPEC ought to be subjecj: tc itore intensive, scrut iny. 
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CHARACTERISTICS CF FREE I NDEX PHRASE S 

Selection of Free-Inde x Phrases ; Indexers at INSPFC choose 
phrases primarily froo the title and abstract of the document. As 
such^ the phrases consist of the author's own words^ suqqestincj a 
hiqh deqree of specificity for the representation. Free-index 
phrases are intended to be "complete in themselves'*' and are not 
raeant to supplement the controlled vocabulary (descriptors)^* The 
purpose of the free-index phrases is to provide a basis for 
searching by the user, and the aim is to include all significant 
concepts which could reasonably forn the subject of a hiqhly 
detailed literature search* 

This approach tc free-index phrases appears to be unique and 
cannot be considered, cofflparable to representations with siailar 
names implemented in other data bases. For example, Psyclnfo 
f Psycholcqica l Abstracts ) contains an "identifier" field which is 
intended to supplement the inforaatioD contained in the controlled 
vocabulary by specifying characteristics the research design or 
the subjects used; these identifiers ♦ are not intende«1 to 
represent the najor significant concepts in the docunent. In the 
ERIC database (Educational Resource Infomation Center) , the 
identifier field is also designed to supplement the cortrolled 
vocabulary. Identifiers here contain all proper names as well as 
terms which aay at some later time be inccrporated in future 
versions of the ERIC thesaurus. 
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At INSPEC^ indexers ar€ assigned docinSents cr. the basis of 
their sobject specialization. Indexers receive the full text of 
the document alonq with its abstract* If no abstract is available 
or if the existinq abstract is too brief, the indexer prepares one 
that will be more suitable* The indexer is charqed with selecting 
words and phrases vhich ^express the siqrificant concepts both 
explicit and implicit" dealt with in the document (INSPEC 1970) • 
The terBS are not selected from an authority list or thesaurus as 
in the case of the controlled index terms (descriptors), tut are 
freely chosen by the indexers. The form of the phrases is not 
standardized since this representation is reqarde^ as free 
(natural) lanquaqe* ^ 

Indexinq procedures are not so much a function of the 
official rules,, as they are of what the indexers actually do in 
practice. The same indexer assiqns all document representations 
(free-index phrases, descriptors, etc.) for a qiven document. 
Host of the free-index phrases are selected by underlining key 
phrases in the title or abstract. Then, for concepts treated 
implicitly in the title or abstract, the indexer creates and adds 
additional phrases. These implicit phrases form a small portion 
of all II phrases assiqned to a document. A manual examination of 
39 documents selected at ra ndom, found only seven of the 19? 
free-index phrases did not appear in the title or abstract of the 
docunent; on averaqe less than one implicit phrase per document. 
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Because the implicit phrases (though fev in number) may have 
had a major influence on the performance of the II representation, 
indexers at INSPEC were interviewed to attempt to determine when 
implicit phrases would be added* At the time of the interview^ 
INSPEC had implemented a revised policy regarding free-index 
phrases, which were intended to aake the phrases a more exhaustive 
representation than it has been previously. This is evident from 
a.ja^ increase in the number of phrases assigned to each document. 
Originally^ there were an average of five phrases per document 
(Waldstein, 1981), while under the new policy the average rose to 
over seven per document* Furthermore, indexers estimated an 
average of two iaplicit phrases per docuaent in contrast with le.ss 
than one previously* This change in indexing policy at INSPEC 
made it difficult tc learn about the indexing practice which was 
in effect when the 1979 test collection was originally prepared. 

Analysis of Free Index Phrases ; To discover seme of thp 
statistical and phrasal properties of the free-index 
representation, several investigations were conducted on small 
random samples of the INSPEC data base* 

A test collection of 99U docuoents (citations plus abstract) 
was created and various statistical counts were made of t^e major 
representations employed in the overlap study. Each of those 
representations was an^xlyzed in several forms. For example, the 
free-index phrases were studied as intact phrases^ as words from 
the phrases, and as word stems from the phrases* The results cf 
this analysis are presented in Table 3* The final entries in that 
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table contain statistical <:ounts of the noun phrases found {by ^ 
parser) in the title and abstract of the document. This NP 
"representation" vas not used in the overlap study* It is 
included here because noun phrases will form the basis of our 
eff orts toward creatinq surrogate free- index phrases 
aotoaatically^ 

Throuqhout the analysis, it will be important to compare the 
II representation with that of the TA. Given that free-index 
phrases derive for the most part from the title/abstracts of 
documents, what can account for the results obtained in the 
overlap study? Both representations p€rforni4d about equally well 
in terms of recall and precision (though there are many fewer IT 
entries per docuaent than TA terms) , but the ll representation 
outperformed title/atstracts in terms of one important measure of 
docuaent overlap^ the proportion of unique relevant documents 
retrieved beyond those retrieved by the other representations* 

Of particular concern was the level of specificity and 
^exhaustivity of the II representations (phrases, words, and word 
stems) in comparison with the ether representatiors* If the 
specificity of an index tern is seasured as some inverse function 
of the number of* documents tc which the term is assiqned^ 
("postings"), the last column of Table 3 suqqests that the II 
representation has a hiqh level of specificity* If the two word 
forms of the II representation are averaged and compared with the 
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Table 3 

Statistical Characteristics of Selected Document Representations* 



Representation 

^ 


Tn 1 

1 U tu 1 

N umbe r 
of Terms 


li \t O Y* A fl O 

n V c 1 dye 

Numbpr of 
Terms/Doc. 


iiumucr UT 
1 J n i n tj p 

Terms 


A tf A v« 3 n A 

M ve r d y c 

1 1 n i n 11 P 
U 11 1 ^ u c 

Terms/Doc . 


1 0 ta 1 

rUo LI nyo 


Ave rage 
rO Stings/ 
Te rms 


AA: Abstract 














Words 


□ 7 1 □ 


on OA 

0 9 . 84 


39848 


4 .84 


Stems 


0 O U 4 U 


D 0 • U4 


cone 


48.41 


38416 


7.37 


TT: Title 














Words 


7662 


7.66 


2690 


7.42 


7419 


2.75 


Stems 


7662 


7.66 


2077 


7.39 


7398 


3.56 


TA : Ti tl e/Abs tract 














Words 


65702 


65.70 


8760 


42 .83 


42837 


4 . 89 


Stems 


65702 


65. 70 


5558 


41.01 


41011 


7.37 


uu « uebLriptors 














Phrases 


2509 


2.50 


907 


2.48 


2482 


2.73 


Words 


5054 


5.05 


858 


4 76 






Word Stems 


5054 


5.05 


* 720 


4.68 


4683 


6.50 


II: Free Index 














Phrases 


4914 


4.91 


4311 


4.89 


4891 


1.13 


Wo rds 


10358 


10.35 


3343 


9.56 


9568 


2.86 


Word Stems 


10358 


10.35 


2418 


9.36 


9367 


3.87 


NP: Noun Phrases 














Phrases 


17349 


17.34 


12068 


16.20 


16176 


1.34 


Words 


29582 


29.58 


• 6960 


23.94 


23942 


3.43 


Word Stems 


29582 


29.58 


4606 


22.74 


22748 


4.93 



*Based on a random sample of 994 Documents 
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other five averaq€Sr we see that free-index phrases have a hiqh 
level of specificity (3.36) (second only tc TT) vhile the 
title/abstract representation has the lowest (6. 13). Thu5, IT is 
45X more specific than TA. 

The exhaustivity of an index tero nay be assessed by some 
direct function of the nunbec of unique terms — either in the 
entire data base or per document (see columns #3 and <fU in Table 
3). Here the free-index phrases perform differently. If a hiqh 
level of exhaustivity in indexing is needed^ then II would appear 
not to be a qood candidate, because it is and 77^ less 

exhaustive than TA. 

Based on these results, one would predict that the free-index 
representation (in comparison with words from the title/abstract) 
would perfora rather well on precision, but rather less well in 
terms of recall. Nevertheless, as noted earlier, II did not 
perform significantly better froa TA in terms of either recall or 
precision. If hiqh specificity is a plausible explanation for the 
precision results, what could accrnnt for the recall performance? 
Clearly, a more detailed examination of free-index phrases is 
needed. 

One approach is to consider ether properties of the phrases. 
Haldstein (1981) suqqested that all subject descriptors {whether 
controlled or uncont relied) take the form of noun phrases. In 
fact he showed that 90.655 of the free-index phrases in INSPEC are 
derived fron noun phrases. This, of course, is not a new notion. 
As far tack as 1968, Armitaqe and Lynch suqqested the use of an 
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autoaatic parser tc locate simple noun phrases in titles for 
indexinq purposes, A decade later^ Borko (1978) suqqested the 
possibility of usinq a set of autoniatic transforms tc "make all 
subiect headings consist of ncuns, qerunds or noun phrases". 

Waldstein's vork was helpful a^t a qross level, tut did not 
provide the kind of detailed analysis needed. A thorouqh 
examination of the 192 free-index phrases that occurred in a 
random saaple of 39 docusents revealed that 

— 71. 3X (137) were unique noun phrases, occurrinq cnly once 
in the docuaent's title/abstract. 

— 18.8?t (36) were noun phrases that occurred more than once 
in the title/ats tract. 

6,3X (12) were noun phrases that did not occur in the 
title/abstract. 

3.6X (7) were index phrases that did occur in the title/ 
abstract, but were not ncun phrased. 

Here a noun phrase was defined as (i) an optional article, 
(ii) zero or caore adjectives^ and (iii) one or more nouns — in 
that order. Several conclusions derive frora this analysis. First 
of all, this soall scale study corroborates Waldst^in's earlier 
work — he found over 90% of the free-index phrases were noun 
phrases, the fiqure here is sliqhtly hiqber (96.4^). The 
difference between the two may be attributed to saraplinq error or 
to the differences in the procedures used. Waldstein used an 
automatic parser with a sliqhtly different definition of a noun 
phrase, this study did the parsinq manually. 
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A corollary to this first point is that any approach to 
qeneratinq surrogate free-index phrases froir noun phrases will 
miss soiie small percentage of index phrases vhich are not noun 
phrases^ 

Secondly, and perhaps even more inportantly, is the presence 
of free-index phrases which were not derived fuoin the 
title/aJastract of a document. Though fev in number, it is 
possible that these "implicit" phrases explain why the IT 
representation performed as well as it did in terms cf recall, 
especially in comparison with the TA representa t ioiu If this 
coniecture is correct, then oost straightforward approaches to 
producing surrogate free- index phrases from the title and abstract 
will niss key concepts* More involved methods oakino use of 
thesauri or other non-document sources of subject knovledge will 
have to be used* For exaaple, the systeia being developed by 
Harding (1982) fragaents and truncates all currently assignerl 
free--index phrases and enters then with conceptual lirks and 
weights into a vocabulary file* This file is then used to assign 
free-index terns automatically cn a statistical basis* This 
approach reguires a pre-existing set of free-index phrases and 
would also reguire indexer-generated phrases to be added to the 
authority file in order to accoooodate charges and growth in the 
sub iect natter* 

It remains to be seen if surrogate free-index phrases can he 
produced from the title/abstract of a docuient^ The evidence so 
far suggests that the surrogate phrases be selected from 
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autoaatically identified noun phrases* The task recraininq is to 
identify the procedure for reducinq the number of noun phrases to 
a more cost/effective subsets Such an approach has the advantaqe 
of siaplicity and does not require outside knowledqe sources or 
the input of human indexers* Of course, if many of the most 
effective free-icdex phrases are derived froo either the implicit 
phrases or froo title/abstract vords which are not noun phraes, 
then this approach will fail. 

Use of Free-Index Phrases :: Retrieval results depend not only on 
the indexinq procedure, but also on the behavior of the searcher. 
In the overlap study, each query was searched by a trained 
intermediary who was automatically restricted to one of the seven 
docuraent representations. The searcher and the representations 
were balanced in a replicated Latin Square design. For the 
purpose of that study we were able to determine that searcher 
behavior differed across the queries, thouqh the statistical 
analysis could not determine if there was a significant 
searcher-representa ticn interaction. Such an interaction would 
indicate that the behavior of the searchers and their kncwledqe of 
the individual representations were inportant components in the 
performance of the free-index phrases as compared with the title/ 
abstract representation. 

Since this infomation was not available froE* thp overlap 
study, the present investiqa t icn souqht other indicators of a 
searcher-representation interaction. The oriqioal seatchets wpr^ 
interviewed (several years after they completed their work), their 
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search loqs were analyzed and several artificial searches were 
created and processed aqai!ist the original data base« 

Six of the seven original searchers were available to be 
interviewed- An open-ended structured goes ticnnaire was d^cvelope'l 
and pretested (see Appendix B) • The questions attempted to 
discover how familiar each searcher was with various data bases 
and with the seven dccuaent representations — with particular 
emphasis on descriptors^ title/abstract terms, ard free-index 
phrases. There was also a series of questions asking if the 
searchers could suggest any reason for the obtained performance of 
the II representa tion* To help refresh the searchers' memories^ 
each was provided with an actual query that they had searched on 
the II representation and the log they produced as they refined 
and searched the data base. 

The iuterviews revealed no clear-cut bias for or against any 
particular representation, though it did appear that ncne of the 
searchers was very coafortable with the free-index phrases. They 
found the phrases tc be very specific to the subject area of the 
data base — an area with which many of the searchers were 
relatively unfaailiar* Most of thea caee froa an envircnraent 
which aade heavy, if not exclusive^ use of the ERIC data base. As 
a result, the searchers were not very faoiliar with the I^^SPSC 
indexing policy (even after a relatively lengthy training period)* 
The interviews revealed that the searchers tended to view th^ 
free-index phrases and the descriptor (CD) phrases as mutually 
exclusive and they sosetiaes went to the trouble of excluding frots 
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their searches to the II representatioiir those teres found in th;^ 
printed INSPEC thesaurj^s. 

' ' '' ■ 

In terms of e,xplai.ninq the recall/precision results of the 
fre^e'-index phrase^^the searchers suggested that the vocabalary of 
that representation appeared to te both exhaustive and specific* 
thereby coabininq the best aspects of both the title/abstract and 
the controlled vocabulary (DD) representa tions^ Accordinq to the 
searchers, the free-index chrases have the advantaqe of usinq 
termiiioloqy that is in curr^t use and which specifically applies 
to each document. Sinie they treated the queries as specific 
search requests^ they thouqht t^ere was a stror.q fit between the 
query and the represent ation* 

Overall, there is little evidence from the interviews of a 
searcher-representaticn interaction, thouqh the interviews dil 
confira cur belief that the free-index phrase representation was 
searched, for the aost oart, on a word basis. It was possible for 
the searchers to use both phrases and words because the inverted 
file contained both types of iteas, but an examination of the 8U 
queries searched under the II representation found that all but 
twelve were searched usinq coibinaticns of individual words. 
Thus, in practice the free-index representation is selected by 
indexers as phrases and used by searchers as words. Selectinq 
pre-coordinated phrases and searching with post-coordinated words 
froa those phrases nay be essential to any atteiapt to understand 
the perfcroance of the free-index phrases* 



ERIC 



26 



Paqe 19 

The interviews did lead to an exaiiinaticn of the search loqs 
to deteroiiie if iiEFortant terns had been dropped froro the TA 
searches but remained in the II searches. Words in the TA 
representation tend to have hiqher postings than those in the TT 
representation (see Table 3). The question here is whether words 
initially included in both sets of searches <TA and II) were later 
excluded from oiie because the postings were either too high 
(presumably for the TA searches) cr too low (for the II searches). 
Evidence of such behavior would indicate that the differences in 
the postings caused the searchers to act differently with the two 
representations — a clue for a searcher-representation 
interaction. ^ 

To answer this question, the 84 TA search logs were compared 
with the 84 II loqs. This coaparison yielded, for each guery, a 
list of terms that were used under both representations {Boolean 
operators were ignored — nakinq the results less realistic). 
These teras were fclicwed throughout the log to see if any were 
eliminated. In all, there were only 22 instances in which a tercT 
was dropped froa the TA search tut was retained in the II search. 
For 18 of these teras, the number of postings for the TA 
representation was higher than that for the II representation. 
This is supportive of the hypothesis that searchers treated the IT 
representation differently than the TA representatior — though 
the size of this interaction is guestionable because only 18 
search terms (out of all terBS used in the 84 queries) are 
involved- 
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l apllcit Free-Index Phrases ; The reoaininq possibility is that 
the II representation is inherectly superior to the TA, since the 
former is derived fro» the latter^r any invest^iqat ion alonq these 
lines mcst focus on the implicit phrases, those not found in the 
title/abstract of the document. 

One way to estiiaate the effect of the implicit free-index 
phrases is to test them in a simulated retrieval experiment. 
Central to such a study is a coiparison of the results of a search 
performed usinq the TA cepresen tation with the results of an 
identical search usinq the II representation. Unfortunately, the 
existinq data (searches and retrievals) from the overlap study are 
based on different searchers usinq different search strateqies on 
the different representations for a single query. 

To obtain a sinqle search for each of the 8^4 queries, the II 
searches vere standardized. Tuis procedure involved Insertinq the 
(W) operator to specify that search words have to be adjacent and 
in the designated order. Thus, the (H) operator permitted the 
searchinq of phrases within the title/abstract. The resultinq 
standardized searches were then resubmitted to the document 
collection usinq the TA representation. Since the searches were 
now identical, any document retrieved by the II search but not ty 
the new TA search could be attributed to the inplicit II phrases. 

The results showed that of the documents retrieved by II, 
implicit phrases were responsible for 10X of the hiqhly relevant 
(28 cut of 283) and 12.4% of the broadly relevant (65 out of 526). 
These ^ percentaqes, t*houqh saall, are certainly not insignificant 
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(particularly in view of the size of the differences in Talkie 2), 
emphasizinq the irapo;:tanc6 of the iiplicit free-index phrases — 
and the difficulty of qeneratinq hiqh-perf orminq surrogate phrases 
autoaatically from the title/abstract of a docuiDent. 

The 28 hiqhly relevant documents wer^e further analyzed to 
determine which phrases were actually rtsporsible for their 
retrieval. The documents were manually examined and the 
retrievinq phrases can be brcadiy classified accordinq to their 
oriqin as follows: ^ i 

23 documents had terms in the free-index phrases that 
did not occur in the title/abstract; these phrases were 
responsible for the docunents* retrieval. 

— five docuaents had terms ia the title/abstract that 
differed syntactically from the retrievinq II terras; 
differences included variatioiis in word order, 
word endinqs and the us^3 of abbreviations or hyphens. 

The five documents in the second class above had implicit 
phrases which could be derived from the contents of the 
title/abstract usinq rules similar to those used by indexers. For 
the 23 documents in the first class, the implicit phrases wece not 
to be found in any form in the title/abstract. The oaiorit-y (19) 
of these phrases were taken from the controlled vocatularly, 
duplicatinq what was found in the descriptor (DD) representation. 
Bearinq in mind that the free-index phrases are meant to "stand 
alone" as a representation, it is reasonable to expect indexers to 
enrich that representation with descriptor terms if those concepts 
are not contained in the title/abstract. 
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The precedinq exaiaination of implicit phrases is based on 
queries and the documents they retrieved. The question rerpains to 
what extent do the results generalize to docuiaents in qeneral? 
Usinq a siall sample of 39 documents selected at random from the 
data base, twelve implicit free-index . phras€s were found. Of 
these , 



nine phrases (75^C) were not found in the title/abstract 
of the document; six of the nine phrases are exact 
duplicates of the descriptor phrases. 

three phxrases (25?) were found in some non-identical 
form {e,q. abbreviaton or change in word order) in 
the title/abstract of th€ document. 



Thus^ there is soire indication that iaplicit free-index phrases 
were instrumental in obtaining the results of the overlap study 
and are in evidence throughout th€ data fcase. 



SummarY ^ The results of ouir analyses of the free-^index phrases 
are not conclusive* There are, however, some suggestions which do 
affect (1) the manner we proceed in our effort to generate 
surrpqate phrases automatically and, (2) our expectations of what 
can be achieved from the title/abstract of the document. 
Specifically, 



free-index phrases have a high degree of specificity; 
this is true for the entire phrase, for words from the 
phrase and for word stems. A high level of specificity 
ought to be expected fros the manner in which INSPEC 
indexers select most of thea from the title/abstract of 
the document. High specificity, intrinsic to the 
representation, may account for the obtained levels of 
precision in the overlap study — levels comparable to 
that of the TA representation. 
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hiqh levels of exhaustivity are not characteristic of 
the free-index representation. Clearly, exhaustivity 
is net responsible for levels of recall obtainerl for 
II that did not differ froa those obtained for the TA 
representation. 

searcher behavior suqqests that the free--index phrases 
were to sooe extent treated differently ^rom 
title/abstract terras. This interaction may account 
for some of what was found in teras of the recall of 
the II and the TA representations. 

it is the presence of implicit phrases, especially in 
relevant documents, that may be most central to II's 
superior performance in couparison with that of TA's. 



The analyses also revealed that searchers used free-index words in 
their interactions vith the data base. One reasonable method for 
approdchinq the automatic qeneration of a surroqate representation 
is to beqin with noun phrases in an attempt to capture the 
specificity needed and the concepts contained in pre-coordina ted 
phrases and then do the retrieval usinq words from those phrases. 
This will allow for maximum flexibility and increase the postinqs 
of each term* The fundamental prcblen remaininq is then tc reduce 
the Tiumber of phrases to sone reasonable level. However, if 
implicit phrases need to be added to obtain acceptable levels of 
performance^ then any approach which does not use knowledqe aids 
or indexer inputs will be limited at the outset. 
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AUTOMATIC GENERATION C£ SURROGATE FREE-INDEX PHRASFS 

Overview of Approach : The search for automatic procedures for the 
identification of effective and efficient document-vE^present ations 
froa docuaents {or specific parts of th^a) has been prcqressinq 
for the past twenty to twenty-five years. Historically two major 
approaches are evident in this research: the statistical and the 
linguistic* The former employs statistical criteria to select 
terras durincj indexing* The latter utilizes the syntactic and/or^ 
semantic features of the docunent to generate index teras. 

The simplest and earliest statistical scheme for automatic 
indexing was proposed by luhn (1958)- He evaluated a term's 
indexing potential for a document on the basis of its frequency of 
occurrence in the document* Following this there is the vast work 
performed by Sparck Jones (1972, 1973), Salton and his co-workers 
(1972, 1973, 1975, 1976, 1981) and others such as Robertson et al* 
(1981). In these studies, the measures of a term's indexing 
potential were functions of the tera's frequency characteristics 
both within the document and within the data base. The results of 
numerous investiqations in the relative merits of statistical 
indexinq methods reaain equivocal. This is partly due to 
differences in experimental desiqn. Sparck Jones (1981) presepts 
a qood discussion on these differences. Further, it is still 
uncertain as to how the results will qeneralize when implemented 
on operational databases. 
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I^rtTally, the expectations regar^inq the pi^actical 
utilization Jof linquistic appx^oaches was optimistic. This was 
replaced later by a widespread pessinisa primarily due to the 
failure of such approaches in machine translation JDanerau, 1970) • 
However^ in recent years there is evident a renewed interest in 
the application of these techniques to automatic indexing^ The 
linquistic approaches to aut/aatic indexing are sliqhtly more 
diverse than the statistical approaches* The indexing systera 
developed by Sager: (1981) represents the biqhest level of 
linquistic sophistication. The system focuses on derivinq a 
tabular representation from the text using syntactic strategies* 
These/ are used to answer queries as well as reconstruct the 
original text* At a slightly lower level of sophistication is the 
PHEASE system (Earl 1972, 1973) which syntactically reduces a text 
to its component phrases and selects from them, using a dictionary 
to specify acceptable phrase foraats. Dillon and Gray's FASIT 
(1983) and Klingbiel's MAI (1973a, 1973b) systeas attempt the same 
objective^ A sliqhtly different approach is taken by Steinacker 
(1973, 197^) who used statistical criteria to recognise 
significant phrases in a text. The linquistic systems mentioned 
above use the docu»ent text or abstract as the unit from which to 
derive indexing phrases. Other work uses linguistic methods on 
smaller units such as the ' docunent titles* The Multilevel 
Substring Analysis procedure as described ty Garfield (1981) is 
one example* The K^iPSI system derives foui^ different substrings 
from each title by parsing; one of the substrings is a noun 
phrase. 
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A coniBon point evident froi aost of the linguistic approaches 
is the iaportance of the ncun phrase* Most of these systems 
directly or indirectly identify ncun phrases fron the abstract or 
title as part of their automatic indexing procedure. In addition 
to these indications of the iaportance of noun phrases^ Waldstein 
(1981) found that itj the IKSEEC data base most of the phrases 
selected for indexing were noun phrases. 

It is possible to short-cut the process fcy beginning with 
noun phrases already selected fcr indexing. Such an approach is 
beinq developed at INSPEC by Harding (1982). His method analyzes 
the existing free-index phrases in the data base. Each phrase is 
then bro)cen into its couponent voxds which are then recoobined to 
produce all possible coabina ticns (singlets, doublets/ etc.). 
Data bas€ frequencies of these combinations and the INSPEC 
thesaurus are then used to eliminate the uniicportant coiofcinations. 
The resultant combinations (phrases) ace stored in a dictionary 
which is used to select or reject phrases from the dccuoient. 
Harding concluded that the automatically generated phrases were 
quite different fros the aanually selected ones. Furthermore/ 
Harding does not report the retrieval effectiveness of the 
surrogate free- index phrases produced in this manner. 

Another approach to the identification of phrases was 
employed by Salton .and Wong (1976). Their work ap^rears to have 
been motivated not so auch by the theoretical value of noun 
phrases as by the enpirical finding that index terms with high 
docusent frequencies {i.e. postinqs) are not effective for 
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retrieval. To itiprove the value of these hiqh frequency terms 
they can be coaabined with other terms forininq a phrase. Salton 
and Wonq use a positional defirition of a phrasse: all pairs of 
word steins no more than one interveninq word apart were taken as 
phrases. These phrases were then tested on three experimental 
document collections. The results indicate that aJdlnq phrases 
increased retrieval performance; phrases composed of low document 
frequency term paired with a roediun or hiqh frequency term were 
particularly effective. 

In contrast^ the approach taken here to produce surroqate 
free-index phrases does not fake use of a pre-*€rstablished 
dictionary of phrases, nor does it use a positional criterion to 
define a phrase. Cur hope is to identify a qeneral procedure 
which could^ in principle^ be applied to data bases that do not 
already contain a type of docunent representation simiJar to the 
free-index phrases. Consequently, our approach must beqin with 
the noun phrases identified from the title/abstract of each 
document. Then a variety of statistical criteria are considered 
to see if it is possible to select from the noun phrases a subset 
which could function as free-index phrases. If statistical 
methods are not able to successfully distinquish aaonq alternative 
subsets cf noun phrases, then eipirical methods will be employed. 

Identification of Noun Phra ses : The parser used was created by 
Haldstein (1981) and is based on an alqorithro developed by Earl 
(1972). The parser works with the aid of an exceptions dictionary 
which contains those words which do not uniquely belonq to a 
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sinqle qransmatical cateqory^ but depend upon context to be 
properly classified. The parser defines a simple ncun phrase as 
consistinq of (a) an optional article, followed by (b) one or more 
adiectives, followed by (c) one cr more nouns. Fach of the three 
components is optional, except that an article cannot stand by 

itself as a noun phrase* Appendix C contains an example of output 

I 

qenerated by the initial version of this parser. 

The oriqinal version of the parser was not useable without 
modification* It had to be changed to accept the entire 
title/abstract as input and produce as output a list of noun 
phrases found therein. These modifications were relatively 
stra iqhtf orward- More troublesome was the difficulty in parsinq 
titles* The parser approaches each sentence ty findinq the main 
verb and then identifyinq nouns and other parts of speech. Many 
titles in INSPEC did not contain a verb, causinq errors in the 
identification of ncun phrases. Correctinq this problem 
accommodated those titles without verts but produced other errors 
when workinq on those few titles which contained verbs. For 
example^ in the title "Proqraominq Endgames with Few Pieces", the 
parser treated the verb "proqra iminq" as an adjective producinq 
the false noun phrase "proqraiainq endqaoes". Errors cf this type 
occurred five times in a sample of ^0 documents used to test the 
parser. 

Because the parser output was to be used as a replacement for 
indexer selected noun phrases, it was necessary to compare parser 
output with that produced by people who were trained to identify 
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noun phrases froa text^ For this test the randoioly selected 
sample of 40 docunents was pars€d, prcducinq 960 simple noun 
phrases* Parsinq the sane docunents by hand yielded 735 phrases. 
Assuainq that the hjjinan qenerated list was correctr an error 
analysis of the parser output was conducted. Beth errors of 
coamissioD and errors of omission were considered. The former 
include all phrases produced by the parser but not ty hand. The 
latter include those phrases found by hand but not identifi'Sd by 
the parser. An analysis of both types of errors is presented in 
Table 4. 

Table 4 

Error Analysis of Initial Parser Cutput* 



Errors of Coaiission: 17.71X (170 out of 960) 

Example: {a) qualifiers beinq selected as noun phrases 

— such as "that there"* 

(b) noun phrases with extraneus words 

— such as "systems make new approaches". 

Errors of Omission: 8.71!S (64 out of 735) 

Example: the noun phrase "data fcriat conversion" 

is identified by the parser as two phrases: 
"data" and "conversion", the word "format" 
was treated as a verb. 



♦ Forty documents were selected, one of which did not contain an 
abstract. Thus, for the purpose of testing the parser, only 19 
documents were used. 



ERLC 



3V 



Paqe 30 

To reduce the Dumber of error::;, the parser was modified to 
clean-up the phrases identified. Two stoplists were added to the 
parser. The first eliminated sinqle word r^i^ser-qenerated phrases 
which were not doud phrases. These sinqle word "phrases" included 
qualifiers^ sinqle letters^ and sinqle adjectives. Also all 
trivial noun phrases such as "the authors" or "this p^F^c" were 
eliminated from the parser's output. The second stoplist was used 
to eliminate trivial sinqle words (such as articles) which beqan 
multi-word noun phrases. 

These modif ica t icns dealt solely vith particular types of 
errors of conmission. Errors of emission and the reaaininq errors 
of cooaission were left unremedied because they resulted from 
textual or syntactic features of the documents which were 
problematic for the parser. 

The original test collection of 19 documents was then 
re-analyzed by the parser. The errors of omission remained 
unchanqed but the errors of commission were reduced from 

170 to 26, yieldinq an error rate of under three percent. Finally 
a new randon sample of a? documents was passed through the parser 
to determine if additional items should fce added to the stoplists. 
The final version of both stoplists is given in Appendix D. 

Parser output for each document in the sample collections was 
then compared with the free-index phrases of those documents. An 
analysis of the overlap between the two sets of phrases would 
provide some indication of the amount of selection needed to te 
done to reduce the larqer set of noun phrases to the smaller set 
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of II phrases. The analysis would also estiffate an upper limit cn 
what can be reasonably expected froa lioitinq the search for 
free-index phrases to the collection of noun phrases derived froir 
the title/abstract of the docunent. 

Table 5 illustrates the results obtained when the comparison 
was conducted on two random sanples of docuisents. Ccoipaiisons 
were perforied on an "exact match" basis using unstemmed woris in 
the phrases. 

Table 5 



Overlap Betveen'Noun 


Phrases 


(NP) and Free- 


■Index Phrases 


(11) 


Collection 


Nuaber of 

HP 
Terms 


Oniqae 

II 
Teras 


Naniber of 
Terms in 
Coamon 


Percentage 
NP in 
Coamon 


of 
II in 
Common 


Words: HQ c 
Oocaaents 


«I91 


305 


185 


37.68 


60.66 


Words: 100 
Documents 


986 


637 


tl17 


U2.29 


65.16 


Phrases: 40 
Docuaents 


325 


187 


59 


18.15 


31.55 


Phrases: 100 
Oocusents 


731 


135 


122 


16.69 


28.05 



The qoa 1 of automatically qeneratinq phrases from the 
title/abstract, that are identical to the free-index phrases^ is 
problematic. Since there is only a 2S% - 32!? overlap acrcnq the 
phrases^ approximately 10% of the desired phrases cannot be found 
in the document. If^ however^ identical words are souqht, the 
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severity of the problem is lessened somewhat. For wotls^ soiae 15? 
- 39% of the terias cannot be found in the noun phrases in the 
title/abstract* Clearly^ these percentaqes^ thouqh smaller^ are 
still si26able and they raise a fundamental question about vhether 
the qoal of qeneratinq identical phrases/wcrds autoaatically can 
be achieved. A more reasonable qoal is tc produce surroqate 
free-indei terms from the title/abstract that have two 
characteristics: (1) their occurrence per document is 
approximately equal to the numher of II terras per documentr and 
(2) their performance in a retrieval^ test approximates that of 
real II phrase words. 

Table 5 also provides an estimate of the task involved. 
Since between 38^ - ^2% of noun phrase words are in common^ 
approximately 60% of all noun phrase words need to be eliminated. 
A similar indication can be found in the statistics of Table 3. 
In terms of the average number of items per document, there are 
17.3a ncun phrases but only a. 91 II phrases. Or, in terms of 
%fords within the phrases, there are over 29 froir the roun phrases 
but only about 10 frcR the II phrases. 

Selection of Free-Inde x Phrase Words f rot Noun Phrases ; The first 
oblective of a selection mechanism is to reduce the noun phrase 
vocabulary to a size comparable to that of the free-index 
vocabulary. The second objective is to select terms that 
contribute to a stronq performance in retrieval {i.e. are "qood" 
indexinq terms). Words rather than phrases were scuqht because 
the task may be easier (see Table 5) and perhaps more importantly. 
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because the retrieval perfomance of the TI representation was 
obtained ky searching on free^-ixidex words. 



To achieve these objectives several commonly used statistical 
selection criteria were considered: those tased on discrimi nation 
valuesjr those based on postinqs, and ^t^ose based on within 
dec u dent frequencies. 



The discriainaticn value (DV) approach to automatic indexing 
has been p^^oposed and studied alicst exclusively by Salton and his 
colleaques. That approach selects as index terms, words that 
disctiainate by increasing the separation amonq 'documents in 
n-disensionai space. Several conclusjbons from the research on 
discrininat ion >alues are applicable here. 



!• Terms in a collection can be ranked according to their 
discrimination values. Those with hiqh DVs are better index 
terob for retrieval than those with DVs near zero. Terms with 
neqative DVs are the poorest index terms. 

2« There is a non-linear relationship between the DV of a term 
and its document frequency. The presence of this relationship 
is iffportant in a practical sense because computinq DVs is 
much more complex and expensive than is computing sinple 
document freqiiencies. 

3. To our knowledqe, no attempts have been made at computing DVs 

on phrases and evaluating the effectiveness of the selected 

phrases. Salton and Bong (1976) briefly discuss this 

possibility, but use a simpler approach for selecting their 
phrases. 



Initially, our goal was to select noun phrases with high DVs. 
Each phrase was to be normalized by reiaovinq trivial words, 
steaming the renaining words and then alphabetizing them so that 
word order was not a factor. Shorter phrases wholly contained 
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vithin longer phrases vithin the saie docoaent vera also 
•liilnat€d« I proqraa to ccspate Dfs of noraalized phrases vas 
deTeloped based on the alqcritba described ty Saltcn, Ha aod To 
(1981). I test of that program on a saaple of the title/abstracts 
of S94 docaieBts revealed further support for the relationship 
betveen 0?s and docQient freguencies. h linear relationship of 
••55 was estiaated aitb the Pearson r; presuaably the 
relationship vould be even strcnqer if a suitable non-linear 
transforaation were eaployed. is a result of finding this strong 
relationship, ve decided not to pursue the use of discriaination 
values as a selection criterion and focused on the aore easily 
obtainable docuaent frequencies and associated statistics. 

Both docuaent frequencies (DF) and vithin docuaent 
frequencies (iDP) hate been eitensively studied for several years 
(e-q. Saltcn, 1975; NcGill^ et al. ^ 1979; Sparck Jones, 1973)* 
¥he results are not \poapletely clearcut, but appear to depend upon 
the database, the type of query, and aany other factors in the 
retrieval environaent. Hoveter, aany of the studies have 
confiraed the value of usinq t^jca collection frequencies in soae 
fora (either DF or the total nuaber of tokens). Furtberaore, 
there is soae support (e.g. Sparck Jones, 1S73) for aodifyinq 
docuaent frequencies by the incliiSion of vithin docuaent 
frequencies. Consequently, the approaches considered here are all 
based on soae variant of 

(Equation 1) 
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wher€ the terss included derive froQ the noun phrases in the 
title/abstract. 

Of the several aethods considered, three emerged as most 
promlsinq* One of these aethods was lased vhclly en the 
individual terms in the noun phrases — each word meeting the 
criterion was selected as a surrogate II term for the document. 
This method nill be designated as the "word" method because the 
surrogate II terns are selected from the union of teriis in all the 
noun phrases. 

The other two methods nake more extensive use of the no 
phrases. Characteristics of the phrase or its component terms are 
examined* If the neasured characteristic exceeds the criterion^ 
then the entire phrase is selected as a surrogate II/ phrase 
(though searching will be based on the conpcnent words). These 
methods vill be designated as "phrase" methods. The three methods 
are described more ccapletely later in this report. 

All three methods operate on stemmed, non-trivial vords from 
the noun phrases in the title/abstract. For the two phrase 
aethods, further normalization included reaoving the effect of 
word order within the phrase and eliainating shorter phrases which 
were completely contained in Icnger phrases within the same 
document. 

The obiective was to identify surrogate II terms (or phrases) 
which matched the existing II jteras/phrases in both number (M) and 
document freguency (DF). He did not want to select many morp 
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terms/phrases than there were II 's in the document. To do so 
would seriously affect the nature of the free-index 
representation. He also believed that substantially altering the 
document frequencies of the selected terns would affect searcher 
behavior and consequently retrieval performance. Each of the 
three methods tested various combinations of the parameters to 
determine which combination produced surroqat€ II teriES with the 
desired statistical properties. In addition^ the actual terras 
selected were coapared with these in the free-index phrases for 
each docuseot. 

Four values were computed for each combination of the 
parameters* 

Number: The averaqe number of surrogate terms per document, 

Pearson: Pearson r between the number of surrogate terns and 
II terns per document. 

Similar: Similarity (DICE) between surrogate terms and 
II teras per document. 

Overlap: Averaqe percent of II terms also in surroqate 
terms per document. 



To provide some indication of an upper bound on these values^ 
a fourth method was developed to maximize the overlap of the 
selected terms with the II terms. This method si raply selected a 
noun phrase if it contained at least one term that was also in an 
II phrase for the document. The four values resultinq from this 
selection are qiver in Table 6. 
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Table 6 



Approiiaate opper Kiaits for Selection Hetbods^ 



Muiber 



Pearson 



Siiilarity 



Overlap 



1U827 



• 8285 



.7772 



87.31 



* Based on a raodoa saaple of 99t| docoients 

Thus^ 87 percent of the II teras were selected and the 
areraqe nuiber of teras per docoient is close to 9.5S, which is 
the aYeraae uuiber of II nord steas per docnaent. 

The three aethods were then tested aqainst a saall collection 
of 100 docuaents. Those coatinations of parameters which 
perforaed best vere tested aqain on the larqer collection of 99^ 
docuients* Paraaeters that depend npon collection size will have 
to be adjusted. The statistical analyses below arc based on this 
larqer database. 

!• iiord Method ; All words in the noun phrases selected by 
the parser froa each title/abstract were steiaed. Duplicate steas 
were eliainated both within a docuient and across the saaple of 
docuaents. The "word tersion** of equation #1 (i«e« iDF/BF) was 
then applied to each tera for several values of 0. fach tero 
above that value was considered a potential surroqate free-index 
tera for the docuaeot it caae froa. h second paraaeter, N. was 
then us€d to liait the noaber of selected surroqate II term s per 
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docQiont. 

Tfce siord ■ethod was applied to a randoo sample of 994 
docunents foe several coMbinations of 9 and H. For each 
coibination, the four statistical valaes described earlier were 
coaputed. Table 7 presents the sost applicable results. 

Table 7 

Results of Applying the Bord Hethod 

Co lb in- ~° 

^tion V 0 Muiber* Pearson Siiilar Oterlap 



HI 


10 


0 


9.65 


.3206 


.4227 


47.59 


92 


10 


.1 


7.307 


.3904 


.3538 


33.47 


H3 


13 


.1 


8.255 


.4059 


.3630 


35.93 


m 


15 


.076 


9.652 


.4250 


.3825 


40.50 


85 


CO 


.1 


9.1*32 


.3940 


.3675 


38-17 


«6 


00 


.2 


6.322 


.3417 


.3070 


27.51 



♦There are 9.59 free-index teris per average docuient 



Of these six coibinations of pacaaeters, B1 and V4 produce 
approxiiately the saie nuiber of surrogate II teras per docaaent 
as there were actual free-index teras. The other values for these 
two coBbinations are quite different fro» their estiaated upper 
liiits (see Table 6). 

2- £hcas€ He thod jM: This aethod, begins by stealing each 
word in all noun phrases found ic the title/abstract. Duplicate 
phrases within each dccuaent are elininated, as are shorter 
phrases which are wholly contained in longer phrases, word order 
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and triTial vords are ignored. Bgaation «1 is then applied to 
eacb of the resulting noraallzcd phrases. Those phrases whose 
valaes of 0 are above '^the parap^eter Talue are selected as 
potential sarroQates for the docuaent £roi which the phrase 
originated. The second paraaeterr was then applied to limit 
the total noBber of surrogate II phrases per docasent. 

I 

Table 8 presents the results cf applying Phrase Hethod #1 to 
the saaple cf 99U docuients. 

Table 8 

Besults of Applying Phrase Hethod #1 
CoMbin- " 

ation N 0 Nunber Pearson Siiilar Overlap 



PI 


5 


0 


10.474 


.3899 


.4849 


56.30 


P2 


5 


.10 


9-951 


-4216 


.4743 


52.97 


P3 


5 


.15 


9.560 


-4221 


.4623 


50.60 


pa 


5 


.20 


9.157 


.4246 


.4506 


48.23 


P5 


5 


.40 


7-603 


.4070 


.3913 


39.23 


P6 


10 


0 


17.094 


-5272 


.5162 


75.78 


P7 


10 


.30 


10.608 


-4300 


.4372 


49.89 


P8 


00 


.UO 


9.569 


-4023 


.4060 


44.55 



4 

I 

Two sets of results {P3 and P8) co«e closest to aatchino the 
nuaber of actual free-index tens per docuaent. In coaparison 
with the word aethod^ phrase method 11 seeas to perfora slightly 
better^ but these differences nay not be aore than can be 
attributed to chance factors. As in the case of the word aethod^ 
the perforaance of phrase aethod II falls sizeably below the 
estiaated upper liaits shown in Tahle 6. 
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A more detailed coeiparisoiDi betveen the two methods vas 
carried out. Three indices of sioilarity were coraputed; 

Doc* Dice: Average sinilarity (DICE) fcetween two sets 
of surrogate II terms, by document* 

Vocab. Dice: Similarity <DICE) bvstween the tctal 

vocabularies of the two sets of surrogate 
II teros. 

DF: Pearson r between the document frequencies of 
the coaaon vocakulary of the tvc sets of 
surrogate Ji terms* 

Table 9 compares the four best combinations HU, and 

P8) in terms of these indices of siailarity. The data indicate 
that the vocabularies generated by the word and phrase methods are 
very sittilar^ but for individual documents the terms assigned are 
quite different and the resulting document frequencies are also 
different. The figures also show that the sinilarity is higher 
within the two types cf methods than between the methods. 

Table 9 

Similarity Among Selected Methods* 



Hi W*l P3 

.793V.9779/.7650 
P3 .639 v. 9 567/. 5589 .6329/. 9377/. U581 

PS .5497/.9818/.5373 . 57 53/. 9679/.a583 .7582/. 9480/. 9765 

♦The three values in each cell are: Doc Dice; Vocab Dice; DF 
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3. Ph rase flethod #2: This »ethod heqins with a 
noraalization of ter»s and phrases selected froB the 
title/abstract. Individual steiaed words are further considered 
if their docunent frequencies fall within a predeteroined range* 
The phrases from which these selected vctJ steos ociqinate are 
then evaluated usinq equation #2 (where a# 3^ and 9 are the 
paraaeters) « 

aEl + 6EWDF ^ 0 (Equation 2) 

Only two coibinations of these paraaeters produced reasonable 
results usinq the data base of SS^ docuaents* 

Table 10 

Results of Applying Phrase Method 12 



Coabin- DF 

ation ranqe age Kuaber Pearson Similar Overlap 

PX 3-30 2 3 11 11.5^12 .4921 .4631 53.72 

PI 1-30 1 2 4 10.48a .4699 .455^ 51.33 



These results are not very different frci those generated by 
Phrase Hethod f1« 

In qeneral, the vocabularies produced by the three aethods 
reveal certain differences, especially with respect tc docaaent 
frequencies of selected steas. Perhaps even aore tellinq is the 
findicq that the statistical analyses of the surroqate free-index 
teras do not Identify any one of the aethods as clearly superior 
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on all Measures (cf* Tables 7^ 8^ and 10). Equally important is 
that the highest measures (regardless of the laethod) are some 385 
- 'MX lower than estiaate of their upper limit (see Table 6) • 

The best assessoent of the perforsance of each of these 
methodSir however^ does not depend solely on the previous 
statistical analyses. These provide^ at best^ clues to how the 
selected surrogate terms will function in a retrieval environment- 
Information retrieval theory is not sufficiently developed to 
allow us to confidently predict poor retrieval performance from 
these figures. Consequently^ we need to conduct actual retrieval 
tests using these methods and coiapare the results with those 
obtained using the actual free-index terns* 
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RETRIEVAL TESTS OF SDBEOGATE FREE-INDEX PBPAS ES 

To test each of the selection aethods, we were fortunately 
able to make use of the data tase^ the search queries, and the 
relevance judqaents used in the Overlap Study. The different 
selection aethcds create different vocabularies of index terms. 
The oriqinal searches to the free-index phrase representation (11) 
needed to be repeated aqainst each of the new vocabularies^ 
Recall and precision could then be conputed for each of the 
queries and the perfcraance of each selection method could be 
compared with each other and with that of the actual free-index 
phrases* 

To simplify the task^ the 6^ queries were examined to see if 
any failed to retrieve a sinqle relevant document (judqed either 
ti-|rt or "2") when searched aqainst the II representation. Seven 
queries were thus eliainated. The reaaininq 77 queries were then 
used to identify a database of ^^^^ docuaents that were actually 
retrieved by the oriqinal II searches. Each of the docuaents 
needed to be parsed before the surroqate II representations could 
be created. The parser failed to handle 28 of the documents* 
Four other documents did not have an abstract and as a result did 
not produce any noun phrases* An exaaination of these 32 
docuaents^ the queries that retrieved thea,, and their relevance 
ludqnents showed no systematic pattern that could be discerned. 
Consequently, these docunents were dropped from the test 
collection. The final retrieval environment used to test the 
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different selection ■echanisis consisted of 77 queries and *082 
d ecu lent s« 

HgiSXimati^ iJi MEaiSlers: it was possible that the various 
selection lethods identified on a randoi saiple of 99a docu.ents 
would behave quite differently on the collection of 4082 retrieved 
docaaents. To consider this possibility, the statistical analyses 
were repeated. Table 11 qives the results for the best set of 
parameters for each of the three aethods and Table 12 qives the 
similarity a«onq then. 

Table 11 

Besults of Applyinq Hethods to Betrieved T)ccu»ents 



Method Parameters Nuaber* Pearson Siailar Overlap 



"OJ^^ H= « 10.90 .3570 . 3301 



0 = .05 



35.09 



Phrase-1 N = «, io.7tl .3661 . 36^8 ao.22 

0 = .09 



3^DF,i:5 0 
Phrase-2 ct = 2 



6 = 3 10.09 .3981 . 3770 tjo.26 

e = 11 



♦In this database there are 10.623 free-index terms per averaqe 
document. 
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Siailarity Ainonq Hethods Given in Table 11* 



Word 



Phrase- 1 



Phrase-1 



• 5902/. 9955/. i«69a 



Phrase-2 



.487'4/,7282/.4887 



.6153/.7348/.9855 



♦The three values in each cell are: Doc Dice; Vocab Dice; and DF 

The pattern here is siailar to that found in Table 9. There 
is a greater sinilarity anonq the complete vocabularies of the 
different methods than there is for each document. Irterest inqly, 
there is more aqreetaent aaonq the two phrase methods than is found 
vith the word method* 

R etrieval Results : The actual free-index phrase representation 
was coapaped with four surrogate representations in terras of 
recall and precision. Three of the surroqate representations are 
those selected by a statistical »axanii nation of alternative 
coabinations of paraaeters; the three combinations tested here 
are described in Table 11. The fourth representation is provided 
for coaparison purposes only. It is composed of ^Q0% of the noun 
phrases identified aanually in the title/abstracts of the 
docusents. 

The 77 queries^ oriqinally searched under the II 
representation, were resubaitted usinq that representation (with a 
slightly altered database) and usinq the four surroqate 
representations. Recall and precision values for all of these 
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searches can be found in Appendix E. Descriptive statistical 
values (e-q, aacro-recall and macro-precision) are also provided 
in that appendix^ 

For each of the four representations, two types of analyses 
were perforaed. First, the results were considered on a 
query-by-query basis to determine the number of queries that 
performed better for the surroqate or for the actual free-index 
phrases in terras of both recall and precision. Secondly, the 
averaqe recall and precision for the surroqate and II were 
coQpared statistically usinq Student's t procedure for correlated 
measures. The results of these analyses are presented in Tables 
13 - 14- 

Table 13 

Performance by Query — II vs, Surroqate 



Surroqate Heasure II > Surr- II = Surr. II < Surr. Total 



All Koun Recall 20 17 «0 77 

Phrases Precision 4a 12 21 77 



Phrase Recall 45 27 4 76 

Method-1 Precision 36 22 18 76 



Hord Recall 47 24 6 77 

Method Precision 38 22 17 77 



Phrase Recall 46 22 8 76 

Method-2 Precision 35 17 24 76 
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Table 13 shows that the actual free-io^ex phrase 
representation performed better on mote queries than any of the 
surrogates. The only exception is the obvious one shown in the 
first row of the Table: all noun phrases as a representation 
perfora better on iDore queries in terns of cecall^than does the II 
representation. It is true that foe some queries the various 
surrogates perforaed better than the 11 representation/ and in 
terms of precision^ the three experiaental surrogates performed at 
least as veil as the actual II representation. flowever, the 
dominant iapression from these data is that the surrogates do not 
perform as well as II does on a guery- to-query basis. 

What cannot be determined frcm Table 13 is how much better 
(or worse) the representations are. To assess that^r the actual 
size of the difference in the recall and precision figures have to 
be considered. 

These figures support the general impression seen earlier^ 
viz-/ with the exception of the "non-surrogate"/ the three methods 
considered all perfccns significantly lower on recall. The 
differences on precision/ though suggesting a lower perfcmance by 
the surrogates/ could all be attributable to chance variation. 
The overall conclusion seems clear/ none of the approaches tested 
empirically perfora better than the actual free-index phraseS/ and 



sizeably so) than the surrogates. Table 1^4 -rorapares the 



in terms of recal 1 / 



phrases perform better (often 
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Table 1U 

ComFarison of Differences Betwean Representations 



Surrogate Mean*** Standard Standard 

minus II Measure Difference Deviation Error t* 



All Noun Recall .066 ,249 .029 

Phrases Precision -.024 .303 -035 



Phrase Recall -.150 .219 

Hethod-1 Precision -.023 .340 

: A;; — 

Word Recall -.148 .273 

Method Precision -.090 .412 



Phrase Recall -. 154 .242 .028 

Method-2 Precision -.035 .324 .037 



♦ A negative value of t indicates that the II repiresentation hal a 
hiqher aean than the surrogate representation. 

V 

**Th€se values of t ate statistically significant at the .05 level. 
***The II means: recall = 0.28; precision = 0.31. 



2.316** 
-.684 



,025 -5. 936** 

.039 -0.594 



,031 -4. .725** 

047 -1.900 



-5.503** 
-0.939 
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DISCUSSION 



There are several possible causes for these results and they 
are not necessarily independent of each other. The first 
possibility is lainly procedural. Throughout the investiqation a 
variety of approximations and liaitations had to be accepted. For 
example^ the parser's perfornance was not perfect; errors of 
oBission of nearly nine percent gould have had a negative impact 
on the effectiveness of the surrogate phrases. Another procedural 
approximation exists in the i^etrieval tests. Several queries had 
to he discarded and 32 documents vere eliminated from the test 
collection because they could not be completely parsed. The 
queries and documents not included in the retrieval test were 
examined to see if 'their renoval might bias the results. Though 
lio such bias was evident^ it is ' st^ll possible that small 
cumulative effects df these and other approximations could account 
for some, if not all, of the final results. 

The other possibilities are more substantive. There is, for 

i 

example, > the linderlyinq assumption that the surrogate 

\ ' * - ' 

representSrtionS' should be based initially on naturally occurring 

phraaes and then searchtJd on the individual voris in those 

phrases. This assumption was based on an analysis of search logs 

td^^he II representation in the Overlap Study. Cne clue about the 

reasonableness of this assumption can te obtained by comparing the 

performance of the two Phrase Hethod surrogates vitb the Word 

Method surrogate. This is not the best test of the assumption. 
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but it is the case that the Phrase Methods niake more use of 
phrases than does the Word Method, Using the data presented in 
Appendix the two types of Methods vere ccmpared statistically 

and no differences were found. That is^ neither Phrase Method 
performed better on either recall or precision than the Word 
Method. Thusr the qeneral approach taken in qeneratinq surrogates 
may be questionable. 

Another possibility was the choice of surroqates. Several 
were considered and these were reduced to the final three (Phrase 
Methods #1 and #2, and the lord Method) after a thorough 
statistical coapariscn was conducted of their vocabularies and 
that of the actual II phrases. However^ it is still true that 
many other surrogates could have been used — though information 
retrieval theory dees not identify any major approaches that were 
not considered-^ Perhaps one or more of the rejected approaches 
(e*q. using discrimination values, Pcisson distributions, or 
syntactic patterns in the text) would have proven more effective. 
Only further exploration will tell. 

The last alternative seeas acre plausible though this is 
not to exclude contributions froa the other possibilities 
discussed above. It seems likely that there was an effect caused 
by the "implicit phrases" — those found in the free-index phrase 
representation which were net found directly in the 
title/abstract. Earlier we estimated that these implicit phrases 
accounted for 10X cf the highly relevant documents r^trifvel and 
12.^15 of all relevant documents retrieved, since roost ot these 
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ioplicit phrasefe darive £roB the controlled vocabulary 
representation^ they could have functioned to broaden the II 
representation sufficiently to account fot sorae of its perfcroance 
in recall* 

If the implicit phrases are a very important component of the 
free- index phrase representticn^ then at tempts to produce 
surroqate phrases autoaa tically will have to incorporate a 
thesaurus {as Hardinq is doinq at IHSPEC) or ttake use of 
statistical methods to identify broad term classes.' Until those 
techniques have been developed and tested, it is difficult to 
conclude that an automatically qenerated representation selected 
from naturally occurrinq precocrdinated phrases and searched on 
their constituent terms is, in qeneral, effective. 
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Appendix A: 
Contents of INSPEC Records 



Each document consisted of a series of bibliographic 
citation fields, the abstract, and some indexing information. 
The format of each document record as it was printed upon 
retrieval is given below. 



INSPEC DNnumber (abstract numbers from INSPEC journals) 
Title 

Authors (separated by commas) 

Source Field: as follows 

Publication: (volume and issue number) 

(part number) pagination data 
following this may be information in ( ), 
This is information on the cover-to-cover 
translation as foil ows : (publication; ( vol ume 
and issue) pages, (date) (type of unconventional 
media) (availability) (Title of Conference) 
(location of conference) (sponsoring 
organization) (date) language). 

Abstract 

I ndexing Information 



ERIC 



Paqe 59 



APPENDIX E 
Questionnaire 



ERIC 



67 



APPENDIX B Page 60 

SYRACUSE UNIVERSITY 



SCHOOL OF INFORMATION STUDIES 

^113 EUCLID AVENUE | SYRACUSE, NEW YORK W21i 
\ 3l5/415-29r 



Dear Mr/Ms: 



We would like to know your response to the questionnaire 
enclosed within. These questions relate to the NSF-funded 
project, ••A Study of the Iitpact of Representations In Information 
Retrieval Systems"^ undertaken by the School of Information 
Studies, Syracuse University in 19 81-1982. You took part in the 
Project as a search intermediary. 

Retrieval from seven different document repre^ ntations 
were studied. They included: 

DD - Descriptor terms chosen by an indexer from the 
thesaurus, a controlled vocabulary. 

AA - Free-text words from the eibstract; trivial words 
excluded. 

. TT - Free- text words from the title; trivial words 
excluded. 

II - Free- text phrases chosen by the indexer. 

DI - Indexer selected terms. A compound representation 
made up of DD and II. 

ST - A stemmed version (automatic *^suf fix removal) of 
representation TA. 

TA - Free- text terms from the title and abstract. A 
compound representation made up of TA and AA. 

The data base for the study was Computer and Control 
Abstracts (a subfile of INSPEC) . The system you were asked to 
use was DIATOM. 

The objectives of the study required you to conduct 
high recall searches, but with a limit of no more than 50 
citations per query. In all, you were asked to search 9 8 
queries. Over the course of the study, you used all seven 
representations, but for each query, only o^e representation 
was assigned. 
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For each query, you were asked to search from a request 
lorin; the statement of the query was prepared by a real user 
who received the output. The request form also prescribed the 
representation you were to use. The unique password assigned to 
the request automatically "locked" the search so that you could 
only search on the designated parts of the citations. 

Prior to conducting any search, you were required to take 
part in a day-long training session. After that, you were 
required to become familiar with DIATOM and the INSPEC data 
base. You submitted fourteen practise searches. 

Enclosed within, in addition to the questionnaire, are 
copies of the searches you conducted and the thesaurus you 
used. 
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QUESTIONNAIRE 1. 



Please answer the following questions to the best of your 
ability. If you cannot recall the answers to a question, 
please write — "CANNOT RECALL". 

1. Before the training session of the experiment, was the 
data base, INSPEC, new to you? 



2. Rank the following six data bases according to the 

degree of your familiarity with each (at the time of 
the experiment) . Rank first the one with which you 
are most familiar. 

COMPUTER & CONTROL ABSTRACTS 
ERIC ' 
PSYCHOLOGICAL ABSTRACTS 
MARC 

CA CONDENSATES 
•'MEDLARS 



3. In a data base with which you are familiar, are you 
inclined to search on 

a) free- text 

or 

b) controlled vocabulary 



4. Given a subject area with which you are familiar, are you 
more inclined to search on 

a) free-text 

or 

b) controlled vocabulary 

5. Rank the seven representations you used in the experiment 
according to how comfortable you felt with each. Rank 
first the representation you felt most comfortable with. 

DD DI 
AA ST 
TT TA 



II 
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6Ca) In the experiment you were* allowed to search only 

individual words in the II field* Did you^ however^ 
conceptualize the 11 's as free-index' phrases rather 
than as individual words? 



Cb) How did you distinguish between representation II and 
representation TA? 



7(a) Did you use the thesaurus in II searches as well as in 
DD siBarches? / 



tb) Or did you rely solely on the text of the query to suggest 
terras for searching on the II field? 



8. What differences do you perceive between the II' s of 
INSPEC and the II *s of other data bases? 



9. Analysis of the results of the experiment showed that II 's 
performed better than DD's in both recall and precision. 
Can you suggest any reasons why this should have happened? 
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Questionnaine 2. 



Searcher Name . p^^e 



interviewer Tape No. 



Introduction ; ^ 



t« J° J^nd If I record the interview? it will make it easier for me 
to discuss the questions with you and free me from concentrating on w^itlnS 
down your responses. ^ writing 

Have you had an opportunity to read the description of the orioinal 
experiment that was mailed to you? original 
Do you have any questions about that stud|y? 
Have you had a chahce to look over your searches? 

(IF INTERVIEWEE ANSWERS "NO" TO THE KIRST OR THIRD QUESTIONS ABOVE TAKE A 
HW NINUTES TO REVIEW THE MATERIALS.) . * 

Please answer the following questions to the best of your ability 
There are no right or wrong answers to the questions - we simply hone to 
get your professional insights into the points raised. ^ 

^' new°to ySS?^*^^^"^"^ sessions of the experiment, was the data base. INSPEC, 



2. Rank the following six data bases according to the degree of your 
familiarity with eath (at the time of the experiment). Give the number one 
11) to the one with^which you were most faailiar. 



. COMPUTER AND CONTROL ABSTRACTS 
.ERIC 

, PSYCHOLOGICAL ABSTRACTS 
.MARC 

, CA CONDENSATES 
MEDLARS 
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3. In a data base with which you are fatnillar, do you have a preference for 
?!!!*.!if^ °^ representation or search field over another, for example, 
controlled vocabulary over free- text? 



BE[ilND''?HE^SERE2cE^ "'^ ^"^ EXPRESSED, PROBE FOR THE REASON 

IS FAMILIARITY, TRANSLATED INTO COMFORTABLENESS, A KEY FACTOR? 
WHAT OTHER FACTORS ARE INVOLVED?) 



4. In a subject area with which you are familiar, do you have a preference for 
one type of representation or search field over another? 



(IF A PREFERENCE FOR ONE OR THE OTHER IS EXPRESSED, PROBE FOR THE REASON 
BEHIND THE PREFERENCE. - 
IS FAMILIARITY WITH THE SUBJECT AREA A KEY FACTOR? 
WHAT OTHER FACTORS ARE INVOLVED?) 
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The next several questions pertain directly to the searches you 
conducted as part of our earlier study. Perhaps it would be helpful to 
refer to the project summary, particularly in thinking about the seven 
different fields or representations used. 



(DRAW INTERVIEWEE'S ATTENTION TO THE DEFINITIONS OF THE REPRESENTATIONS 
THE PROJECT SUMMARY SHEET.) 



5. The seven representations you used in the experiment are described on the 

project siiiwnary sheet. Rank the representations according to how 

comfortable you felt with each. Give the number one (1) to the 
representation with which you were most comfortable. 



DD, descriptor terms n, free- index phrases 

AA, free- text words di, indexer-selected terms 

from the abstract TA, free- text terms from 

TT, free- text words the title and abstract 

from the title ST, a stemmed version of TA 



Now I'd like to narrow the focus a bit to look at three of the 

representations in particular — descriptors (DD), free-text wor3s from the 

title and abstract (TA), and free-index phrases (II). What differences do 
you perceive among them in the INSPEC data base? 



(REFER TO ObSERVATIONS ON INDIVIDUAL SEARCHES IN DISCUSSING QUESTIONS 6 AND 



6. a) Here is a new query.' Underline the words you would choose if you were 
asked to search a field containing only free-text words (TA). 
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Searcher Name 

6.b) Now circle the words you would choose If you were asked to search a field 
containing only free-index phrases ill). Of course you can circle terms 
you have already underlined. 



(COLLECT QUERY. WITH SEARCHER NAME FILLtU IN. AND STAPLE TO QUESTIONNAIRE.) 



6.c) How do you distinguish between free-index phrases (representation 11) and 
free- text words from the title/abstract (representation TA)? 



Now Td like to concentrate on the searches you conducted as part of 
our earlier study. Copies of three of those searches were mailed to you 
for review. Of particular interest are the searches on the free-index 
phrase (II) field. 

7. a) Describe jiow you formulated your search on free-index phrases (11). 



{PROBES, AS NECESSARY. DID YOU RELY SOLELY ON THE TEXT OF THE QUERY TO 
SUGGEST TERMS? 

DID YOU BROWSE THROUGH SOl-lE DOCUMENTS TO FIND RELATED TERhS TO USE? 
IF SO, WHAT CRITERIA DID YOU USE TO CHOOSE THESE RELATED TERMS?) 
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Searcher Name 


• 


7.b) In the experiment the computer searched only individual words in the 
free-index phrase (11) field. Did you, however, conceptual ize^tfie terms as 
free-index phrases rather than individual words? 




7.c) Me gave you a thesaurus to assist in searching descriptor terms (DD). Did 
you also use the thesaurus when you searched free-index phrases (ll)? 




(IF NO, 60 TO QUESTION 8.) 

(IF YES, PROBE - HOW DID YOU MAKE USE OF THE THESAURUS WHEN YOU SEARCHED 
FREE-INDEX PHRASETTI I S ) ? ) i>tHKunLu 


•* 

j 

4 

t • 


8. In the original study, we were particularly concerned with two measures of 
the retrieval performance of the representations, recall and precision. 
The results showed that free-index phrases (lis) performed well on both 
measures. " 




Recall is the number of relevant documents retrieved by a single fi^iH 
or representation as a proportion of the total number of relevant 
documents in the data base. A high recall search, then, retrieves a 
large proportion of th^ documents in a data base that are relevant to 
the query. A low recall search retrieves relatively few of the 
relevant documents. 




a) Can. you suggest some reasons why free-index phrases (lis) did well in 
terms of recall? 




b) Can you suggest some reasons why lis might have performed better than 
descriptors (DDs) in terms of recall? 


o 
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♦'age t> 



Precision is the number of relevant documents retrieved by a single 
field or representation as a proportion of the total number of 
documents retrieved by that representation. The document citations 
resulting from a high precision search, then, contain relatively few 
irrelevant items. Conversely, a low precision search retrieves a 
greater number of citations that are not relevant to the query. 

c) Can you suggest any reasons why free-index phrases performed so well 
in terms of precision . 



Another striking result had to do with the unique contribution of the 
different representations. That is, for a given representation, what 
relevant documents did it retrieve that were not retrieved by any 
other representation. 



d) hree-index phrases (lis) were effective in ' retrieving otherwise 
unretrieved relevant documents. Can you suggest* any reasons why this 
might have happened? 



e) Can you suggest any reasons why lis might have done better than 
free-^text words (TA) in* retrieving unique documents? 
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9. Are you familiar with free-Index phrases in data bases other than 
INSPEC? If so, what differences do you perceive between the lis of 
INSPEC and those of other data bases? 



When I return to my office, I'll be going over this .questionnaire and the 
tape to make sure that I've completely understood your responses. May I 
have your phone number so that I may call you to clarify any points I may 
have missed? 

Phone 



Thank you very much for your time and patience. 
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Appendix C 
Initial Parser Output 



Document was entered one sentence at a time; input is designated 
by an asterisk ( *) along the left margin. Output consists of 
words from each sentence identified according to possible grammatical 
c X ass • 



-ENieR NEW SENTENCE. END WITH A PERIOD 

*MARKET UNCERTAINTIES AND INFORMATION SEARCH-A STOPPING RULE. 
'NAP(NA VB MARKET ), 
'VBP(NP Vb UNCERTAINTIES), 
(CJ AND) 

'NAP{NA INKORMATION)(NA SEARCH-A), 
ilPTP(PA ST0PP1UG)'NAP(NA VB RULE), £ 



ENTER NEW SENTENCE. END WITH A PERIOD 
^CONSIDERS THE QUESTION OF COST-BENEFIT ANALYSIS ON A PUBLIC 
♦INFORMATION SYSTEM WHICH IS DESIGNED TO REDUCE UNCERTAINTIES FOR ECONOMICAL 
*AGENTS. 

'nap(np vb considers), 

'nap(ar the)(na question), 

>prp(pr of)'nap{na cost-benefit )(na vb analysis), £ 

SPRP(aV PR ON)'NAP(AR A){NA PUBLIC)(NA INF0RJ-IATI0N)(NA SYSTEM), £ 

'NAP(AJ PN QUaL WHICH), 

'VBP(VB AX SX IS)(PV PP PTl DESIGNED), 

$NFP'NFP(AV PR T0)(NAVB REDUCE), 'NAP (NP VB UNCERTAINTIES), £ 
$PRP(PR FOR)'NAP(NA ECONOMICAL )(NP VB AGENTS), £ 



ENTER NEW SENTENCE. END WITH A PERIOD 
*TH£ AUTHOR USES AN ARROW-DEBREU MODEL. TOGETHER WITH INFORMATION 
♦measures similar to THE ONES USED IN CLASSICAL INFORMATION THEORY. 
'NAP(AR THE)(NA AUTHOR )(NP VB USES), 
'NAP(»aR AN)(NA ARROW-DEBREU) (NA Vb HOUELJ, 
(PU ,) 

(AV TOGETHER) 

iPRP(PR WITH)'NAP{NA INFURriATION)(NP VB mEASURES)(NA VB SIMILAR), £ 
>PRP(aV PR TO)'NAP(AR THE)(NP ONES), £ 
'VBP{PY PP USED), 

>PRP{AV PR iNj'NAPlNA CLaSSICAL)(NA 1NF0RmaTI0N)(NA VB THEORY) £ 
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Appendix D 
Part 1: Single Word Phrases 



1 
2 
3 
4 
5 
6 
7 
8 
9 
0 
A 

ACCORDING 

ACCURACY 

ACCURATE 

ADAPTABLE 

ADVANTAGEOUS 

AGE 

ALL 

APPLI CABLE 
APPROACH 
ARTICLE 
AS 

ASPECT 

ATTRACTI VF 

AUTHOR 

AUTHORS 

AWARENESS 

B 

BASIS 

BELONG 

BLOCK 



C 

CASE 

CASES 

CLASS 

COMMENDABLE 

C0i<1PUTES 

CONCEPT 

CONJUNCTION 

CONSIDERABLE 

CONSIDERATION 

CONTEMPORARY 

D 

OT 

DATA 

DE«ALS 

DEPENDENT 

DETAIL 

DISCRETE 

DISCUSSION 

DOES 

DYNAMIC 

E 

EACH 
EIGHT 
EITHER 
ENOUGH- 
ERA 

ESTIMATE 
ETC. 
EXAMPLE 
EXAMPLES 



EXIST 

F - 

FALL 

FASHION 

FAVOUR 

FEATURES 

FIVE 

FOUR 

FUNCTIONAL 
6 

GENERAL 

GIVE 

H 

HE 

HOW 

I 

IBID 
IDEA 

ideal' 

IF 

ILLUSTRATE 

IMPORTANT 

JL.NFLUENCE 

INTEREST 

ISOLATES 

IT 

J 

K 

KIND 
L 

LARGE 
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LINES 

LOOK 

M 

MEDICAL 

MENTION 

METHOD 

MODULAR 

MORE 

MOVE 

MOVES 

MUCH 

N 

NEWEST 

NINE 

0 

OFFERS 
ONE 
OTHER 
P 

PT 

PAIRS 
PAPER 
PART 

PARTICULAR 
PARTS 
PRAGMATI C 
PREL IMINARY 
PREVALANCE 
PRINTING 



PROBLEMS 
PROCEDURAL 
PROCESS 
POSSIBLE 
POSSIBI LITY 

Q 
R 

RECENT 

REDUCES 

REGARD 

REMARKS 

REST 

RESULT 

RESULTS 

REVIEW 

S 

,^ SEVEN 
SHOW 
SIDES 
SIMPLE 
SIX 

SOLVABLE 
SOLVING 

some' 

STUDIES 
STUDY 

sue: 

SUITABLE 
T 

TECHNICAL 



TERMS 

THAT 

THEM 

THERE 

THESE 

THIS 

THOSE 

THREE 

THUS 

TOO 

TRANSIT 

TWO 

U 

USE 

USES 

UNIVERSAL 
V 

VIEW 
W 

WAYS 

WHEN 

WHERE 

WHICH 

WHILST 

WHO 

X 

Y 

Z 

ZERO 
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''art 2: Initial Word of Multi-Word Phrases 



A KEEP SOMETIMES 

ALL MANY SPECIAL 

AN MEASURING STRAIGHTFORWARD 

ANY MORE STUDIES 

AS MOVE STUDYING 

AUTHOR MINIMIZE SUBSTANTIAL 

AUTHORS KO THAT 

BOTH ONLY THE 

CONSIDERABLE OWN THEIR 

DEVELOPED PART THERE 

DT PARTICULAR THESE 

EACH PAST THIS 

EVERY POSSIBLE TO 

EXACT PRESENT TYPICAL 

FURTHER RELATED USING 

GIVEN RESULT USUAL 

GIVES RESULTS VARIOUS 

HIS SAME VERY 

IS SEE WHEN 

ITS SOM^. WHERE 

• WHICH 
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APPENDIX E 

Recall and Precision of Surrogate and 
Actual Free-Index Representations 





E-1: Surrogate: 


All Noun 


r n r a 5 e 5 
















Recall 




Pr^»r i c inn 




rt. 


Surrogate Free-Index 


OU rr Uya Lc 


rr ee— iiiucA 


101 


0.10 


0.23 


0.16 


0*28 


102 


0.00 


0.57 


0»00 


0 • 80 


103 


0.45 


0.33 


0*44 


0*44 


104 


0.30 


0.50 


0*22 


0.6A 


105 


0.17 


0.21 


0*16 


0.45 


106 


0.50 


0.38 


0.67 


1*00 


107 


0.42 . 


0.25 


0.19 


0.67 


108 


0.39 


0.22 


0.19 


0.24 


109 


0.49 


0.18-^ 


0.A9 


0.38 


110 


0.95', 


0.89 


0.2Z 


0.35 


111 


0.00 


0.0.0 


0.00 . 


0.00 


112 


0.40 


0.33 


0.2Z 


0.45 


113 


0.67 


0.56 


O.ZB 


0. SO 


114 


0.12 


0.35 


0.09 


0. 16 


115 


0.33 


0.44 


0.A7 


0.57 


116 


0.00 


0.00 


0.00 


0. 00 


117 


0.00 


0.00 


0.00 


0. 00 


118 


0.50 


1 .00 \ 


0.06 


0.23 


119 


0.30 


0.50 


0.75 


0.83 


120 


0.00 


0.00 


0.00 


0. 00 


121 


0.80 


0.64 


0.2A 


0* 55 


122 


0 .86 


0.57 


0.07 


0. 09 


123 


0.00 


0.00 


0.00 


0. 00 


124 


_ 0.21 


0. 10 


0.6A 


0. 88 


125 


0.23 


0.38 


0*38 


0« 83 


126 


0.38 


0.20 


0«56 


0.5Z 


127 


0.56 


0.00 


©•20 


0. 00 


128 


0. 19 


0.38 


0 •25 


0. 23 


129 


' 0.00 


0.00 


0#00 


0. 00 


130 


0.35 


0.54 


0.6Z 


0. 92 


131 


0.10 


0.10 


0 .25 




133 


0.57 


0.Z6 


0.26 


0. 53 


135 


0.45 


0.55 


0.69 


0.78 


136 


0.56 


0.29 


0.24 


0.24 


137 


0.10 


0.40 


0.33 


1.00 


138 


0 .33 


0.33 


0.04 


0.07 


139 


0.32 


0.13 


0.16 


0. 13 


140 


0 .39 


0.28 


0.41 


0.68 


141 


0.61 


0.56 


0.22 


0.24 




0.05 


0.00 


0.20 


O.OO 
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Recall 



Precision 



Query 


Surrogate 


Free-Index 


Surrogate 


Free-Index 




A A A 




A A A 

0 « 00 


A / iy\ 


148 


0* 60 


0*80 


0# 18 


0*29 


149 


0* 43 


0«00 


0«03 


^ ^ y\ 

0^00 


1 


A < T 
0 * 13 


0« 13 


A OA 


n o«c 


1 Ul 




o ^o 


o no 


n on 




A OA 


O OA 


o on 


n 1 *^ 

u « Xu 




1 .00 






0*09 


1 Ril 


1 » 00 


o.oo 


1 .00 


0 . 00 






O 


1 no 

X « Vf u 


o ^ no 


1 




Oil 


V « WW 


o 'xn 


1 «J/ 


1 


o no 


O Af\ 


n no 


IDO 




O 




n nx 


1 c^O 
1 Ot 




O A7 


O OO 








o no 


o no 


n nn 


16J 


A *7 O 

0 » 32 


A TO 

0*32 


A 1 X 


A 07 

0 « 23 


163 


A A A 
0 » 00 


A A A 
0 * OO 


A An 


n AO 


164 


A *7 O 

0 » /y 


A A A 
0 • \J\J 


A VO 

0 « /y 


A A A 


165 


yv A cr 

0» 45 


0*52 


A 4 O 

0 « lo 


A y) A 

0 ^44 


4 y y 

1 66 


A T A 
0 • 30 


A O O 


A i "7 
0 • X / 




1 6/ 


A T Z 

0 • 36 


A OO 


U « J>3 


n c:7 


168 


A i "7 
0 • 1 / 


A i i 
0*11 


A AO 

u « oy 


A i 1 

U ♦ X X 


1 A9 


1 * 00 


1 .00 


1 *.oo 


0.20 


1 7n 
i / U 


0 * 09 


o . oo 


0 . so 


o ^ no 


171 


0 * 90 


0.19 


0.31 


O . 49 


1 79 


0 .OR 


0.00 


0,17 


0.00 


173 


0.14 


0.00 


0.45 


0.00 


174 


0.20 


0.10 


0.29 


^ cr y\ 

0 . 50 


175 


0.56 


0.52 


0.91 


0.97 


176 


1.00 


1.00 


0.10 


0.12 


177 


0.17 


0.00 


0.33 


0.00 


I/O 








0 .34 


179 


0.86 


0.71 


0.29 


0.23 


180 


0.04 


0.00 


0.06 


0.00 


181 


0.10 


0.00 


0.33 


0.00 


182 


0.47 


0.48 


0.59 


0.91 


183 


0.56 


0.17 


0.07 


0.38 


184 


0.38 


0.44 


0.43 


0.54 


9ean 
»*>dian 
ltd dev 


0,35 
0.32 
0.29 


0,28 
0,22 
0.27 


0,29 
0.22 
0.26 


0,3Jl 
0,24 
0.30 
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APPENDIX E 

Recall and Precision of Surrogate and 
Actual Free-Index Representations 

E-?: Surrogate: Phrase Method II 



Recall Precision 
Query Surrogate Free-Index Surrogate Free-Index 



101 


0,00 


102 


0. 00 


103 


0 . 12 


104 


0. 02 


105 


0. 04 


106 


0,25 


107 


0.08 


108 


0.00 


109 


0.04 


110 


0.53 


ill 


0.00 


112 


0. 07 


113 


0,56 


114 


0 . 00 


1 15 


0 . 04 


116 


0. 00 


117 


0 . 00 


1 18 


0 . 33 


119 


0. 30 


120 


0 . 00 


121 


0.16 


122 


0. 14 


123 


0. 00 


124 


0.00 


125 


0.05 


126 


0.00 


127 


0.00 


128 


0.06 


129 


0.00 


130 


0.20 


131 


0.00 


133 


0.29 


135 


0.02 


136 


0.07 


137 


0.33 


138 


0.33 


t39 


0.08 


MO 


0.00 


141 


0.22 


14? 


0.00 



ERIC 



0.23 


0 . 00 


0 ♦ 28 


0.57 


0 .00 


A OA 

0 • 80 


0 • 33 


0 . <c6 




0.50 


0 *25 


0 . 64 


0 •21 


4 A A 
X • UU 


O A*^ 


0.3B 


i A A 

I • 00 


1 ♦ uv 


0.25 


0 • 40 


0 ♦ 6/ 


0.22 


0*00 


A 


0 . 18 


0 ♦ 40 


A TO 


0.B9 


0.3A 


P #35 


0.00 


0*00 


/\ /\/\ 

0 ♦ 00 


0*33 


0 •33 


A AtZ 


0*56 


0 • 50 


0 • 50 


0 .35 


0.00 


A 1 ^ 


0»44 


0 . 33 




0 • 00 


A A A 
0 . 00 


A AA 


0» 00 


A - A A 


A AA 


1 • 00 


U . DU 




0.50 


1 . uu 


. oo 


A A A 
0.00 






A Ji A 


U . O X 


0 ♦ 55 


0 . 57 


A AO 

0 . Oo 


A AO 

u . uy 


0.00 


0.00 


0.00 


0.10 


0.00 


0.88 


0.38 


0.75 


0.85 


0.20 


0.00 


0.53 


0.00 


0.00 


0.00 


0.38 


1 .00 


0.23 


0.00 


0.00 


0.00 


0.54 


0.94 


0.92 


0.10 


0.00 


0.50 


0.36 


0.67 


0.33 


0..55 


0.50 


0.78 


0.29 


0.14 


0.24 


0.33 


1.00 


1.00 


0.33 


0.17 


0.07 


0.13 


0.33 


0. 13 


0.28 


0.00 


0.68 


0,?,6 


0.22 


0.24 


0.00 


0.00 


0.00 
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E-2: Surrogate: Phrase Method #1 



Recall Precision 



Query 


Surrogate* 


Free-Index 


Surrogate 


Free-Index 


147 


0*00 


0*00 


OoOO 


0*00 


148 


0*00 


0*80 


0 ♦ 00 


0.29 


149 


""1 ♦OO 


0*00 


0* 10 


0*00 


150 


0* 13 


0 * 13 


0*20 


0*25 


151 


0*00 


0 * 00 


0» 00 


©♦00 


152 


0*00 


0*20 


0*00 


0*15 


153 


0*67 


0*67 


0*09 


0*09 


lb4 


0* 00 


0* 00 


0 ♦ 00 


0.00 


155 


0* 00 


0 * 00 


0^ 00 


0 •OO 


156 


0*04 


0*11 


1 «00 


0*30 


157 


0*50 


0*00 


0*50 


0*00 


158 


0*00 


0*05 


0*00 


0*06 


159 


0» 20 


0*47 


0*30 


0*35 


160 


0» 00 


0*00 


0 * 00 


0 * 00 


162 


0» 05 


0*32 


0* 07 


0*23 


163 


0*00 


0*00 


0*00 


0*00 


164 


0 • 50 


0 *00 


0*88 


0* 00 


165 


0*10 


0 * 52 


0*18 


0*44 


1 66 


0* 00 


0*22 


0*00 


0*38 


1 67 


0 • 1 4 


0 * 29 


0 • 40 


0*57 


168 


0» 06 


0*11 


0* 09 


0*11 


169 


1 . 00 


1 *00 


1 *00 


0*20 


170 


0. 00 


0*00 


0* 00 


0.00 


171 


0* 00 


0*19 


0* 00 


0*42 


172 


0» 00 


0*00 


0*00 


0*00 


173 


0*00 


0*00 


0*00 


0.00 


174 


0*10 


0*10 


1*00 


0*50 


175 


0*31 


0*52 


0*89 


0*97 


176 


1*00 


1*00 


0*67 


0* 12 


177 


0* 17 


0*00 


0*50 


0*00 


178 


0,00 


0*45 


0*00 


0*34 


1 79 


0 • 7 1 


0*71 


A CT A 

0 * qO 


0 * 23 


180 


0,00 


0*00 


0*00 


0.00 


181 


0,10 


0*00 


0*67 


0,00 


182 


0,09 


0*48 


0.46 


0.91 


183 


0,00 


0.17 


0*00 


0.38 


184 


0.O2 


0.44 


0*25 


0.54 


jDean 


0.13 


0.28 


0.29 


0,31 


median 


0.04 


. 0.22 


0.14 


0,24 


std dev. 


0.22 


0.27 


0.35 


0,30 



*The recall for query 149 is missing 
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APPENDI.X E 

Recall and Precision of Surrogate and 
Actual Free-Inde> Representations 

E-3: Surrogate: Word Method 



Recal 1 



Precision 



ERIC 









Surroaatp 




101 


0.00 


0.23 


0.00 


0.28 


102 


0.00 


0.57 


0.00 


0.80 


103 


0.31 


0.33 


0.56 


0.44 


104 


0.00 


0.50 


0.00 


0,64 


1 05 


0.00 


0.21 


0.00 


0.45. 


106 


0.25 


0.38 


1.00 


1.00 


107 


0.00 


0.25 


0.00 


0.67 


108 


0.06 


0.22 


1.00 


0.24 


109 


0.02 


0.18 


0.25 


0.38 


110 


0.79 


0.89 


0.43 


0.35 


111 


0.00 


0.00 


0.00 


0.00 


112 


0.00 


0,33 


0.00 


0.45 


113 


0.22 


0.56 


0.50 


0.50 


114 


0.00 


0.35 


0.00 


0.16 


115 


0.00 


0.44 


O.OO 


0.57 


116 


0.00 


O.OO 


O.OO 


0.00 


117 


0.00 


0.00 


0.00 


0.00 


118 


0.33 


1 .00 


0.20 


0.23 


119 


0.00 


0.50 


1.00 


0.83 


120 


0.00 


0.00 


0.00 


0.00 


121 


0.24 


0.64 


0.55 


0.55 


122 


0.00 


0.57 


0.00 


0.09 


123 


0.00 


0.00 


0.00 


0.00 


124 


0.00 


0.10 


0.00 


0.88 


125 


0.00 


0.38 


0.00 


0.83 


126 


0.00 


0.20 


0.00 


0.53 


127 


0.00 


0.00 


0.00 


0 .00 


128 


0.06 


0.38 


0*50 


0.23 


129 


0.00 


0.00 


0.00 


0.00 


130 


0 . 




0.96 


0 .92 


131 


0.00 


0.10 


0.00 


0.50 


133 


0.21 


0.36 


0.50 


0.33 


135 


0.00 


0.55 


0.00 


0.78 


136 


0.15 


0.29 


0.43 


0.24 


137 


0.00 


0.40 


0.00 


1 .00 


138 


0.00 


0.33 


0.00 


0.07 


139 


0.00 


0.13 


0.00 


0.13 


140 


0.00 


0,23 


0.00 


0.68 


141 


0.61 


0.56 


0.22 


0.24 


142 


0.00 


0.00 


0.00 


0.00 
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Recall Precision 



Ouerv 


Surroaate 




Surrogate 


Free-Index 


147 


0.00 


0.00 


0.00 


0.00 


148 


a. 00 


0.80 


0.00 


0,29 


149 


0*00 


0.00 


0.00 


0,00 


150 


0.13 


0.13 


0.20 


0.25 


151 


0.00 


0.00 


0.00 


0.00 


152 


0.00 


0.20 


0.00 


0.15 


153 


0.67 


0.67 


0.11 


0.09 


154 


1.00 


0.00 


1.00 


0,00 


155 


0.00 


0.00 


0.00 


0,00 


156 


0.00 


0.11 


0,00 


0.30 


157 


0.50 


0.00 


1.00 


0.00 


158 


0.00 


0.05 


0.00 


0.06 


159 


0.13 


0.47 


0,33 


0.35 


160 


^ 0.00 


0.00 


0.00 ^ 
0,50 


0.00 


162 


0.09 


0.32 


0,23 


163 


0,00 


0.00 


0.00 


0,00 


164 


0,79 


0.00 


0.79 


0<00 


165 


0.00 


0.52 


0.00 


0,38 


166 


0.00 


0.22 


0.00 


0,38 


167 


0.00 


0.29 


0,00 


0,57 


168 


0.00 


0.11 


0,00 


0,11 


169 


1.00 


1 .00 


1,00 


0,20 


170 


0.00 


0.00 


0,00 


0,00 


171 


0.00 


0.19 


0,00 


0,42 


172 


0.00 


0.00 


0,00 


0,00 


173 


0.11 


0.00 


0,40 


0,00 


174 


0.00 


0*10 


0,00 


0,50 


175 


0.56 


0.52 


0,91 


0 ,97 


176 


i .00 


1.00 


1.00 


0,12 


177 


0.00 


0.00 


0,00 


0,00 


178 


0.00 


0.45 


0,00 


0 ,34 


179 


0.71 


0.71 


0,83 


0,23 


1 80 


KI * KIKI 




0.00 


0,00 


181 


0.00 


0.00 


0,00 


0,00 


182 


0.09 


0.48 


0.67 


0.91 


183 


0.00 


0. 17 


0.00 


0,38 


184 


0.00 


0.44 


0,00 


0,54 


mean 


0.13 


0.28 


0.22 


0.31 


median 


0.00 


0.22 


0.00 


0.24 


std dev. 


0.26 


0 .27 


0.35 


0.30 
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APPENDIX E 

Rec*!! and Precision of Surrogate and 
Actual Free-Indax Representations 

Surrogate: Phrase Method 12 



Recal 1 



Precision 



Query 


Surrogate 


Free-Index 


Surrogate 


Free-Index 


1 n 1 

1 U 1 






o»oo 


0.28 


102 


0 « 00 




ft ftft 


0,80 




0 . 1 R 


0 ♦SS 


0*39 


0.44 




U « 1 X 


v/ • 


0 .38 


0 .64 


105 


Of 04 


/\ 01 
V •ill 


0*50 


0 .45 

v * ~ w 


106 


Of 0\? 


U • OO 


A ftft 
\i * \J\J 


A * \/W 


107 


U f U4 




0.33 

w * WW 


0*67 


108 


0 • 06 




A "IT 


ft * 24 


109 


f\ IT 


A in 


A • Sft 


V • WW 


1 1 u 




v ♦ O T 


0 •32 


0*35 


111 


A A /\ 

0 • 00 


/% AA 


ft ftft 


0.00 


112 


A 1 V\ 


V/ • oO 


A * A7 


0 .45 


1 13 


0 • 67 


A Kl. 

yj • D6 


A 

U * OD 


ft * Sft 

v * w v 


114 


0*00 


0 • OD 


A Aft 


ft * lA 


115 


A AO 


A AA 


v * O / 


0*57 


1 1 o 


0 * no 


0 ♦OO 


0*00 


0*00 


117 


0 • 00 


0 • oo 


A ftft 


0 * 00 


lib 


A T3t 


1 * ftft 


0*33 

\f * WW 


0*33 


1 1 o 

1 1 T 


A tfS 


0 «dO 


1 *Oo 


A 

U * OO 


120 


0 ♦ oo 


A A 

U • 00 


A A A 

u * uu 


A A A 


121 


A OQ 


ft * A4 


W * wW 


0 • 55 


122 


A 1 A 


ft * S7 


0 . 09 


123 


"A A/\ 
0 • 00 


A AA 


A ^ ftft 
w * w 


0,00 




ft ^ ftO 


0*10 


0 . 00 


0.88 


125 


0»05 


0*38 


0.60 


0.83 


126 


0^10 


0*20 


0.67 


0.53 


127 


0^00 


0*00 


0.00 


0.00 


128 


0*06 


0*38 


1.00 


0.23 


129 


0*00 


0*00 


0.00 


0.00 


130 


0*27 


©♦54 


0.85 


0.92 


131 


0*00 


0*10 


0.00 


0.50 


133 


0*29 


0*36 


0.44 


0.33 


135 


0*04 


0*55 


0.67 


0.78 


136 


0*27 


0*29 


0.31 


0.24 


137 


0*33 


0*33 


1 .00 


1,00 


138 


0*00 


0*33 


0.00 


0.07 


139 


0*13 


0*13 


0.45 


0.13 


140 


0*00 


0*28 


0.00 


0.68 


141 


0*56 


0*56 


0.29 


0.24 


142 


0*02 


0*00 


0.50 


0.00 



ERIC 
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E-4: Surrogate: phrase 


Method 12 












Recall 


Precision 








Query 


Surrogate* 


Free-Inde^c 


Surrogate 


Free-Index 








147 


0*00 


0 . 00 


0 . 00 


A A A 
0 • 00 








148 


0.00 


0.80 


0.00 


A 

0 • 2V 








149 


"l^OO 


0.00 


0 . 06 


A A A 
0 • 00 








150 


0 • 13 


0* 13 


A OA 

0 • 20 


A n«s 








151 


0*00 


0*00 


A /\ A 

0 • 00 


A A A 








152 


0* 10 


0*20 




A IS 








153 


0 ♦ 67 


0 •6/ 




0 * 00 
V ♦ W 






154 


0»00 


0.00 


0. 00 


A A A 
0 • 00 








155 


0«00 


0^00 


A ^ A 

0*00 


A A A 

o • uu 








156 


0» 00 


0*11 


/\ A A 

0 • 00 


A TtA 








157 


0.50 


0^00 


'0»50 


0*00 








158 


0.00 


0.05 


0*00 


^ A / 

0« 06 








159 


0.13 


0.A7 


0* 22 


A 1*Z 

0. 3u 








160 


0^00 


0.00 


©.♦GO 


0*00 








162 


0.05 


0.32 


0«07 


0« 23 








163 


0*00 


0 • 00 


A A A 
0 • 00 


A A A 








164 


0.57 


0*00 


0.72 


A A A 

0* 00 








165 


0 . 16 


0«d2 


A 1 >4 

0 • 24 


A A A 








166 


0 • 04 


A OO 

0 * 22 




A "XP 








167 


0. 14 


0.29 


A "IT 

0 • 33 


A ^.7 








168 


0.11 


A 4-4 

0*11 


U • ID 


A 11 








169 


0 • 00 


i A A 


A AA 


0 . 90 




t 


170 


0. 00 


0 . 00 


0* 00 


A A A 








171 


0.00 


A 4 n 

0 . 19 


A A A 










172 


0 . 08 


A A A 

0 . 00 


A "IT 


0 no 








173 


0.05 


0.00 


0«40 


0 . 00 








174 


0.00 


0.10 


0^00 


0*50 








175 


0.33 


0.52 


0.95 


0.97 








176 


1 .00 


1 .00 


0.40 


0.12 








177 


0.17 


0.00 


1.00 


0. 00 








178 


0.03 


0.45 


0.33 


0. 34 








179 


0.43 


0.71 


0.43 


0.23 








180 


0.00 


0 .00 


0. 00 


0 . 00 








181 


0.10 


0.00 


0.33 


0.00 








182 


0.11 


0.48 


0.70 


0.91 








183 


0.00 


0.17 


0.00 


0.38 








184 


0.04 


0.44 


0.25 


A KA 








mean 


0.13 


0.28 


0.28 


0.3J 








med i an 


0.05 


0.22 


0.24 


0.24 








std dev 


0.19 


0.27 


0.30 


•0.30 








* The r 


ecal 1 for query 


149 is missing. 
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